Making a Web Scraper to Download Images off the Internet

One afternoon I read on a popular website that http://prnt.sc/ uses sequential 6 character codes to host user images on their website, this made me wonder what was on there.

The next day I made a small bot to scrape the website and collect all images through a range and then the bot could run multiple times to collect more images if necessary. I left the bot running for a couple of hours and here’s what I managed to find, I’m sure I cannot re-host the images but the range I scraped through was gmmlaq for 1,287 images before the bot was IP banned through Cloudflare, fair enough. I took the time to view each image individually.

Here’s What I Saw

  • A drivers licence and matching passport which was expired.
  • A WordPress username and password combination for a web-host reseller which I did not test.
  • Many Many out of context conversations, half of which were in Cyrillic.
  • A teacher seemingly contacting students and recording the fact they did not pick up through skype.
  • Ominous pictures of a tree posted multiple times.
  • Screenshots of video games, mainly Minecraft, Runescape, Team fortress 2 and League of Legends.
  • A lot of backend-databases of usernames and email addresses for customers and users, in fact, they are a large proportion of the screenshots.
  • A lot of SEO spam.
  • A conversation between two users through skype debating over banning an influencer from their platform for fake referrals.
  • About 2 lewd photos.
  • A few hotel confirmations.
  • Whole credit card information including CVV and 16 digit number.
  • A spamvertising campaign CMS platform.
  • A gambling backend database disabling access to games for specific users.
  • One 4×4 pixel image and One 1×47 pixel image.

What Did we Learn?

  • Stuff like this, particularly URLs should not be sequential.
  • A lot of users on the platform see the randomness of the URL as sufficient security however, its undermined by the fact the website can be scraped sequentially.
  • They did eventually ban the bot after 1,287 images, which is probably closer to 1,500 images before testing however Cloudflare seems to be the one preventing access, so it may be a service they offer.
  • A lot of users on the platform are web developers and use every trick in the book to boost their numbers.
  • A lot of users are Eastern European and American.

How I Made the Scraper

I made this bot using Python 3.7 however it may work on older versions. The URL is base 26 encoded to match the alphabet, incremented and then converted back to a string for scraping. Images are saved with their counterpart names. I do not condone running the scraper yourself.

import requests
import configparser
import string
from bs4 import BeautifulSoup
from functools import reduce

# Scraper for https://prnt.sc/


# Headers from a chrome web browser used to circumvent bot detection.
headers = {
    "ACCEPT" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "ACCEPT-LANGUAGE": "en-US,en;q=0.9",
    "DEVICE-MEMORY": "8",
    "DOWNLINK": "10",
    "DPR": "1",
    "ECT": "4g",
    "HOST": "prnt.sc",
    "REFERER": "https://www.google.com/",
    "RTT": "50",
    "SEC-FETCH-DEST": "document",
    "SEC-FETCH-MODE": "navigate",
    "SEC-FETCH-SITE": "cross-site",
    "SEC-FETCH-USER": "?1",
    "UPGRADE-INSECURE-REQUESTS": "1",
    "USER-AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
    "VIEWPORT-WIDTH": "1920",
}

# https://stackoverflow.com/a/48984697/2697955
def divmod_excel(n):
    a, b = divmod(n, 26)
    if b == 0:
        return a - 1, b + 26
    return a, b


# Converts our '89346963' -> 'gmmlaq'
# https://stackoverflow.com/a/48984697/2697955
def to_excel(num):
    chars = []
    while num > 0:
        num, d = divmod_excel(num)
        chars.append(string.ascii_lowercase[d - 1])
    return ''.join(reversed(chars))

# Converts our 'gmmlaq' -> '89346963'
# https://stackoverflow.com/a/48984697/2697955
def from_excel(chars):
    return reduce(lambda r, x: r * 26 + x + 1, map(string.ascii_lowercase.index, chars), 0)

# Load config or start a new one.
# Image start is random
def get_config():
    try:
        config = configparser.ConfigParser()
        with open('config.cfg') as f:
            config.read_file(f)
        return config
    except:
        config = configparser.ConfigParser()
        config['Screenshots'] = {'imagestart': 'gmmlaq', 'url': 'https://prnt.sc/', 'iterations': '20'}
        with open('config.cfg', 'w') as configfile:
            config.write(configfile)
        return config

# Save image from url.
def get_image_and_save(website_url, image_url):
    try:
        html_content = requests.get(website_url + image_url, headers=headers).content
        soup = BeautifulSoup(html_content, "lxml")
        #with open('image_name.html', 'wb') as handler:
             #handler.write(html_content)
        ourimageurl = soup.find(id='screenshot-image')['src']
        #print(ourimageurl)
        image = requests.get(ourimageurl).content
        with open(image_url + '.png', 'wb') as handler:
             handler.write(image)
    except:
        print (image_url + " was removed probably.")

def increment_image(image_url):
    return to_excel(from_excel(image_url) + 1)

config = get_config()
print ("Starting at '" + config["Screenshots"]["imagestart"] + "'.")

website_url = config["Screenshots"]["url"]
current_image_url = config["Screenshots"]["imagestart"]
for x in range(0, int(config["Screenshots"]["iterations"])):
    print("Currently downloading image " + current_image_url)
    get_image_and_save(website_url, current_image_url)
    current_image_url = increment_image(current_image_url)

# Set new config code to current location for next run.
config.set('Screenshots', 'imagestart', current_image_url)
with open('config.cfg', 'w') as configfile:
    config.write(configfile)

The bot requires Python, configparser and BeautifulSoup4. The scraper cannot handle numbers in the URL so please remove them and replace them with letters before picking a starting point, this was an oversight on my part.

Don’t do anything against their terms of service, Aidan.

Scraping Canvas (LMS)

Because my time at university is ending I thought it best to archive the canvas pages available to me for later reference should I not be able to access canvas later if they change platforms or disable my account. I should probably add this is for archival purposes and I will not be able to share the data I was able to collect. Thankfully I was able to get the whole thing going in a few minutes and downloading took a lot longer.

The first snippet I got from here, didn’t complete the first time, it seemed some image was causing issues so I moved to another gist, at this rate we could be done in half an hour 😊.

Unfortunately it also borked out on a similar place,

FileNotFoundError

I think it is because there’s something missing or I don’t have access to it. But the real problem is that its downloading content for a course I didn’t care about because I was enrolled in it but it’s full of junk I’m not interested in, so we can remove it by using the second scrapers code and specifying the course id’s which I had to manually go through, there was about 15 of them but it didn’t take too long. Which gave me the full command.

F:\Downloads\canvas>python canvas.py https://canvas.hull.ac.uk/ 4738~DUI9Nha9weSuemu1M2qsmhljoBcQtR0zghXTs3QA7ECHDHQkpsgBQ9RllbaEwySf output 52497,56148,56149,52493,54499,54452,54456,53441,52496,22257,22274,22276,22277,22278,22279,22280,50664,50656,22275,50652

The access token you can see above should be expired by now. You can do it yourself by downloading the same file and installing python3, pathvalidate and pycanvas. You need to generate a security token from /profile/settings and you can get the course id by clicking on the course like this /courses/56149. When you generate a new token you should receive an email about it.

Canvas online with our starred modules displated.

I decided to make a small adaptation to catch the FileNotFoundError and went off to the races. It took over an hour so I decided it was best to leave it running overnight, when I returned in the morning I had 116 errors (failed downloads) and the rest is the course content!

Our Canvas Modules saved to Windows File Explorer.

Unfortunately I don’t seem to have the submissions for each of these courses so I needed to manually download them aswell and then our archive was completed.

Thanks for reading.

How to remove ‘Google’ from the Gboard Spacebar

A recent update to the Google Keyboard Gboard has added the word ‘Google’ to the keyboard spacebar at the bottom. I personally didn’t like this addition to the app and at present, I cannot find a way to disable it in the latest version.

There is however a solution to this problem, you can roll back your Gboard app to stock and not update it again. Beware that it will reset your keyboard settings in doing so (such as the theme).

To do this,

  1. Open the Play Store app.
  2. Swipe in from the left and choose ‘Settings’.
  3. Select ‘Auto-update apps’ and choose ‘Don’t auto-update apps’.
  4. Select Done.
  5. Go back to the Play Store main page.
  6. Search for ‘Gboard’ and select ‘Gboard – the Google Keyboard’.
  7. Then select ‘Uninstall’ to roll it back to stock.
  8. Viola. You shouldn’t have ‘Google’ on your spacebar anymore.

Unless you have a custom ROM your app should go back to the factory version that came with the phone. If for some reason that uninstalls the keyboard completely for you, you may wish to download an older build from a respectable location. I can confirm that the version I am currently using is 8.3.6.250752527-release-arm64-v8a26830614 and it’s not present for my OnePlus3 but I imagine the releases are hardly innovative as keyboards tend to go.

That should be all you need to remove the word ‘Google’ from your android keyboard spacebar, you could also install another keyboard if you particularly wanted to.

Early High-Level Programming Languages

In the 1950s and 1960s, there was innovation in the field of computer programming and design. Computers were becoming commercially available and starting to gain widespread interest. In 1951 to Univac 1 was the first commercially available computer for example.

FORTRAN

In 1957 came about FORTRAN. FORTRAN was considered one of the first high-level programs to really gain popularity. Its design was suited for high performance when programmed and could perform code optimization to improve the performance of programmers’ instructions. It was ‘Formula Translating’ and its success saw it spread to other computers early on.

Fortran was built for number crunching and computing. Its implementations were widespread and its general-purpose capabilities saw use in many scientific fields of research. Fortran was produced over a series of years under different versions with compatibility for previous versions in many cases. It was by modern-day standards considered low level but no doubt was formulative for other modern-day languages, it included features like code comments, input-output handling and one of the first do loops. FORTRAN has many versions and is still used today.

ALGOL

ALGOL was developed around the same time as Fortran, it was designed for more ‘Algorithmic’ purpose. ALGOL 58 was considered a prototype version named IAL (International Algebraic Language) and was soon superseded by ALGOL 60. It was designed to be more human-readable and could be used to design algorithms and unlike Fortran, it was not designed to be hardware-specific to be the fastest but relied on the best implementations they thought were suitable. Although not as popular as Fortran, many modern languages have features present in ALGOL first, such as IF ELSE statements and dynamic arrays defined at run time.

Stored Program Computers

A Stored Program Architecture such as Von Neumann Architecture stores programs in computer memory and if not using an interpreted or JIT programming language the memory and data for programs can be treated the same. This methodology made programming in previous generations a lot easier as it meant the computer could be programmed using punch tape or cards.

The first ‘Fully’ stored programming computer was the Manchester Mark 1 which was first operational in April 1949.

https://en.wikipedia.org/wiki/Manchester_Mark_1
However there is some dispute as to the true ‘first’ stored program computer.
https://en.wikipedia.org/wiki/Stored-program_computer

Before the use of punch cards or tape, computers could be programmed in a similar method using wires. This meant a lot of re-wiring and ‘patching’ was difficult on complex systems as the wires would encompass whole rows of machinery and in some cases could take many miles to complete a program.

In Addition to the ‘Von Neumann’ architecture, there is also the ‘Harvard’ architecture which keeps data and program (memory) registers independent.

Programming Order of Succession

  • Early computers were not re-programmable. They were hard-wired.
  • Then punch tape and punch cards were developed to feed into computer memory to be computed.
  • Then programs started to use machine code, although complex for a human to develop they were one of the first innovations that allowed for easy computer programming and rapid development.
  • After machine code, a new symbolic form of machine code was created whereby complex hardware instructions could be reduced to line by line instructions. Hence the first machine code compiler was created in order to turn assembly code into machine code.

Machine Code

Machine code could be considered a modern-day programmers’ lowest level of access to a computer’s processor. Machine code provides basic instructions that are logical or mathematical to store, move or load instructions. It is possible on modern hardware to virtualize machine code and some modern programming languages like Java can compile programs into byte code where the initial program can be computed on many platforms.

Assembly

Due to the complexity of Machine Code, the need for a language that was human-readable (and later developed into high-level programming languages) birthed Assembly, an easy to use (comparably) way to program the computer in a methodology and symbolic sense designed for human readability. Instead of numeric OPCODES, the new syntax allowed for easily identifiable instructions (MOVL, JMP, ADDL). Embedded software and real-time systems may still use machine code as its primary source code today.

Interpreted Languages

As high-level languages and capslock languages were adapted, interpreted languages were developed (the 1950s onward) by using the computer to virtualize itself and process instructions as a ‘virtual machine’ (in the literal sense) that could aide the use in porting the language to other computers as the language syntax could be a defined standard and the compiler could compile the program to many forms of machine code for different models and modes of computer. Interpreted Languages have the added benefit of allowing the programmer to debug their program at a more granular level where programs could be inspected line by line before translation (compilation and then execution).

Byte code such as Java byte code can be interpreted or compiled just in time (JIT) whereas the program is running, it is also translating the byte code into machine code. However, this additional workload has a performance toll on the program.

High Level Programming Languages

High-level programming languages reduce the learning curve and frustration that comes with debugging a program as the syntax and grammar of the language are much easier to understand for the reader because it is based on a more natural approach to human interpretation. However, the benefits can cause programs to be compiled in a way that does not optimize performance or utilize capacity fully because the language could be interpreted in a way that the programmer did not intend or was not aware of.

However, the development of High Level Programming languages allowed for much more rapid development than its predecessors which meant performance hits to compiled machine code programs did not exceed the increased development performance (The programmers were able to do their job easier, which made the programs better) in the 1950s.

Example Early High Level Programming Languages

  • FORTRAN (Formula Translation)
  • COBOL (Common Business Oriented Langauge)
  • ALGOL (Algorithmic Language)
  • LISP (List Processing)
  • BASIC (Beginners All Purpose Instruction Code)

Sometimes called the Capslock Programming Languages.

Programming Languages could be classified by their approach to programming paradigms such as statements or methods, functions or object-oriented. Most modern-day programming languages are fit for general purpose.

How Much Does My Car Cost Per Mile?

I thought its time I put my GNUCash Data to good use and worked out how much I spent on fuel. I loaded up a simple Cash Flow bar chart in GNUCash and selected my expenses column for Car>Petrol and voila.

Monthly petrol costs since 2017, highest at £167, lowest £10. Apparently I did not buy any petrol in July – Also checked this. Total expenses from 01/01/2017-01/01/2020 £2,659.79.

If you are astute you may have also noticed there are additional ‘fixed’ (varies annually/monthly) costs to running a car such as tax, insurance, maintenance and depreciation but I have chosen to eliminate these costs because I would like to explore the benefits of buying another car. We can use this graph based on the last 3 years to estimate this car, my 1998-1999 Vauxhall Corsa costs £73.88 monthly (excluding this month from the data)

My Car

So How far does £74/month get you?

Good Question, I’ve always used my car when I’ve needed it. I have little reason not to use it. I’ve used it to commute to University, School and Work and on days out. It’s my main mode of transport is what I’m trying to say.

Unfortunately, I don’t keep the history of my car’s odometer, however, I can use the mot history of my car to estimate the £/per mile. Using the mot history, which has the date my car was taken in for MOT for two different dates recorded in the MOT history, one in 2017 and one in 2019, we can determine in that time I did 19,205 miles so roughly 6.66k miles a year.

We can then use this 19,205 miles, which have around 1 year, 11 months between them to get (74*23) £1,702 expenditure during that time, which compared with the actual data gives £2,051.23 (over £300 diff, 21%) gives us a fairly low confidence, however we can use this to estimate my cars cost per mile on fuel alone is around £0.106 per mile. 10p per mile (2051.23/19205) or 740 miles per month, give or take 20%. That’s 24 miles per day!

Inside a Western Digital Blue Hard Drive

I thought I’d share pictures I took when I took apart a 250GB dead hard drive.

Rest in pieces my WD2500AAKX

I got this hard drive as part of a Dell Optiplex 780 and used it as a server for my internal network. It worked great until it wouldn’t boot. I checked on it and sure enough, it was stuck in ubuntu server boot recovery. I tried to recover it but I think I did more damage than good. I decided to move to a Windows computer and tried to recover the data with Recuva which didn’t do anything because it couldn’t pick up the disk, so then I moved to TestDisk which was able to see the drive and partitions but never got past profiling the disk. So then I decided to take it apart.

The hard drive in the Dell Optiplex 780 covered in dust
The hard drive in the Dell Optiplex 780

First I unscrewed all the screws, there is another screw holding the read/write head under the label.

Hard drive and hard drive mainboard
Front of hard drive and hard drive mainboard

After that I took it apart a little more, it has one platter internally and one big old magnet which I kept.

WD2500AAKX internals with platter and read write head exposed
Well, I’ve let the magic smoke out now.

Interestingly there seems to be a metal piece on the bottom and side of the hard drive which I think is for easy destruction. CrystalDiskInfo said it had 29202 hours on it and 2875 power ons, nearly exactly the same as my ST2000DM001-1CH164 T2B hard drive. The smart data also had warnings for its Reallocated Sectors Count.

It’s in the bin now. Thanks for reading.

Customer Focus in Business

Understanding that a customer has needs when using a good or service can allow a business to identify marketable opportunities for increasing profitability or maximizing revenues. For a large portion of customer focus, its about communication.


Small Breakdown of developing customer focus.

A customer focused approach can be adopted by many aspects of a business, such as;

  • Sales
  • Management
  • Location
  • Customer Service
  • Marketing
  • Growth and Extensibility

Many customers will have different needs and goals and there are many aspects to a business that may need to change to adopt a customer-first approach, but the payoff is;

  • High Customer Retention
  • Long Term Commitment
  • Greater Profitability
  • Greater Customer Satisfaction

However, adopting such an approach may also have some negative business consequences;

  • Increased Spending
  • Increased Overheads
  • Increased After-Sales spending
  • Immediate responses and on-site negotiation
  • Lower Profitability
  • Harder Automation or lack there-of

Providing a Customer-Focused Approach to Sales

Giving the customer what they want is paramount to ensuring a customer-focused approach. Customers usually can appreciate a hands-off approach to getting things done and are usually willing to pay extra for it. Providing a service that is better than the competition or providing greater pre-sales support increases, for example, through online-chat or in-person representation allows the business to increase their potential to close a sale and provide the customer greater satisfaction in their choice.

There are many ways to provide a custom approach to sales;

  • Offer a product that is superior to competition – If your business is able to deliver a product better than the rest, you can capitalize on its potential to increase the customer’s satisfaction.
  • Use Marketing that drives the customer toward package solutions – providing a complete service, rather than a means to an end will allow for greater satisfaction, and as a by-product greater opportunities for increased added value.
  • Guide the customer – Inform the customer of any regulations or licensing that they may need, arrange to set that up for them as part of the service.
  • Offer tertiary products that complement their purchase.
  • Provide Pre-Sales service to ensure the customer is satisfied through demonstration or information.
  • Understand the customer’s stated clear needs and objectives to provide a product they would be satisfied with.
  • Know when the customer is ready to talk, and when they aren’t.
  • Exceed the customer’s expectations.
  • Develop relationships that the customer values.
  • Offer solutions to suit the needs and concerns that the customer may have before purchase.
  • Don’t be passive. Engage with the customer
  • Ensure customers receive what they ask for and gauge success

Offering a way for the customer to reflect their satisfaction, through survey or metrics will allow a business to identify where they achieve, exceed or disappoint the expectations of the customer.

Providing a Customer-Focused Approach to Management

A large part of Customer Focus for management staff and management, in general, is providing proper training for staff to fulfill the needs of the customer above and beyond their expectations,

  • Management shouldn’t be a roadblock between the customer and the sales staff. Provide a framework that can be followed such as a budget or develop routine customer stories.
  • Provide training to ensure the sales staff know what isn’t allowed.
  • Use appropriate means of communication, don’t push for sales.
  • Have measures in place to prevent abuse, A case study about continental found on average the lowest value customers whose flights were delayed were receiving the highest compensation.
  • Know your market segment and the needs of the customers, if a customer does not care about your values as a business, you need to change to be competitive.
  • Provide staff with a view to the customers’ interests and an incentive to stick to it.
  • What is the best way to collect customers’ responses and respond to issues?
  • Coordinate your teams as a group with clear ground rules and goals but don’t alienate the customer

Providing a Customer-Focused Approach to Location

When a customer wants a product or service, they may be willing to pay more than the going rate for convenience, more-so due to the new market for app-based food deliveries and same-day online shopping. Having the customer see your storefront when they need to is a perfect situation for both parties.

Many businesses also opt to help the local community and sponsor community projects.

References;

My Echo Dot broke and I’m kind of mad about It

I got an Alexa in 2016 as a birthday present and I used the thing almost daily! We ended up with 3 in the house, one in the Kitchen and two in different bedrooms. I also installed a Sonoff smart switch in the ceiling light in my bedroom which meant I could easily turn off the light right from the comfort of my bed (using sonoff-tasmota). It was great until one day my £50, 4-year old 2nd generation echo dot stopped working.

I went into my room and said ‘Alexa, turn on the light’ and was met with blunt silence. I looked over at my echo dot and it had its blue “I’m working, leave me be” blue light with single white light rotating around it, I walked over to it and after a considerable amount of minutes (7 or 10) I decided it had been this way for a longer-time than it should have been and was stuck in a boot loop or something, I opted for a switch-it-off-and-on-again approach. It booted into its solid blue bootloader, and then sat there spinning its blue and white light again in silence.

I left it for hours to no avail, I tried resetting it by holding down its mute and volume down buttons and nothing. I’ve tried all manner of combinations of button-pressing, uber and volume down with mute, all at once, a combination of one another. I gave up after about an hour of pressing and pushing buttons on it and decided that it was a software issue and would not wake up from its blue spinning trance.

I decided to search online and can’t find anyone with the same issue, there seem to be a few people who used the wrong power chord but I’m not one of them and I tried a new led and power-brick anyway. I posted on amazon’s digital devices forum and was met with standard troubleshooting that as I expected, but followed with optimism, did not work.

I also briefly looked at seeing if I could re-image the echo using fastboot, I knew it ran android but after reading online and trying for myself (uber and usb to PC, echo showing a green light) I saw that the echo is fairly locked down and not able to be accessed this way.

Which is where we come to why I’m mad. It’s for two reasons. In the first generation echo dot there was a physical and separate reset button that if I were able to use on the second generation echo dot I’m sure would allow me to hard reset the echo dot, but because the thing is stuck in some software upgrade or something, its only good as a bad police light, a paperweight, landfill.

And the second reason I’m mad is not allowing me to do this with another PC, ADB and Fastboot I’ve only used a few times but allowed me to extend the lifetime of my devices and customise them to my heart’s desire.

You let me down Amazon, now I have to get up and turn off the light like everyone else.

My Echo Dot 2nd Generation