Making a Web Scraper to Download Images off the Internet

One afternoon I read on a popular website that http://prnt.sc/ uses sequential 6 character codes to host user images on their website, this made me wonder what was on there.

The next day I made a small bot to scrape the website and collect all images through a range and then the bot could run multiple times to collect more images if necessary. I left the bot running for a couple of hours and here’s what I managed to find, I’m sure I cannot re-host the images but the range I scraped through was gmmlaq for 1,287 images before the bot was IP banned through Cloudflare, fair enough. I took the time to view each image individually.

Here’s What I Saw

  • A drivers licence and matching passport which was expired.
  • A WordPress username and password combination for a web-host reseller which I did not test.
  • Many Many out of context conversations, half of which were in Cyrillic.
  • A teacher seemingly contacting students and recording the fact they did not pick up through skype.
  • Ominous pictures of a tree posted multiple times.
  • Screenshots of video games, mainly Minecraft, Runescape, Team fortress 2 and League of Legends.
  • A lot of backend-databases of usernames and email addresses for customers and users, in fact, they are a large proportion of the screenshots.
  • A lot of SEO spam.
  • A conversation between two users through skype debating over banning an influencer from their platform for fake referrals.
  • About 2 lewd photos.
  • A few hotel confirmations.
  • Whole credit card information including CVV and 16 digit number.
  • A spamvertising campaign CMS platform.
  • A gambling backend database disabling access to games for specific users.
  • One 4×4 pixel image and One 1×47 pixel image.

What Did we Learn?

  • Stuff like this, particularly URLs should not be sequential.
  • A lot of users on the platform see the randomness of the URL as sufficient security however, its undermined by the fact the website can be scraped sequentially.
  • They did eventually ban the bot after 1,287 images, which is probably closer to 1,500 images before testing however Cloudflare seems to be the one preventing access, so it may be a service they offer.
  • A lot of users on the platform are web developers and use every trick in the book to boost their numbers.
  • A lot of users are Eastern European and American.

How I Made the Scraper

I made this bot using Python 3.7 however it may work on older versions. The URL is base 26 encoded to match the alphabet, incremented and then converted back to a string for scraping. Images are saved with their counterpart names. I do not condone running the scraper yourself.

import requests
import configparser
import string
from bs4 import BeautifulSoup
from functools import reduce

# Scraper for https://prnt.sc/


# Headers from a chrome web browser used to circumvent bot detection.
headers = {
    "ACCEPT" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
    "ACCEPT-LANGUAGE": "en-US,en;q=0.9",
    "DEVICE-MEMORY": "8",
    "DOWNLINK": "10",
    "DPR": "1",
    "ECT": "4g",
    "HOST": "prnt.sc",
    "REFERER": "https://www.google.com/",
    "RTT": "50",
    "SEC-FETCH-DEST": "document",
    "SEC-FETCH-MODE": "navigate",
    "SEC-FETCH-SITE": "cross-site",
    "SEC-FETCH-USER": "?1",
    "UPGRADE-INSECURE-REQUESTS": "1",
    "USER-AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
    "VIEWPORT-WIDTH": "1920",
}

# https://stackoverflow.com/a/48984697/2697955
def divmod_excel(n):
    a, b = divmod(n, 26)
    if b == 0:
        return a - 1, b + 26
    return a, b


# Converts our '89346963' -> 'gmmlaq'
# https://stackoverflow.com/a/48984697/2697955
def to_excel(num):
    chars = []
    while num > 0:
        num, d = divmod_excel(num)
        chars.append(string.ascii_lowercase[d - 1])
    return ''.join(reversed(chars))

# Converts our 'gmmlaq' -> '89346963'
# https://stackoverflow.com/a/48984697/2697955
def from_excel(chars):
    return reduce(lambda r, x: r * 26 + x + 1, map(string.ascii_lowercase.index, chars), 0)

# Load config or start a new one.
# Image start is random
def get_config():
    try:
        config = configparser.ConfigParser()
        with open('config.cfg') as f:
            config.read_file(f)
        return config
    except:
        config = configparser.ConfigParser()
        config['Screenshots'] = {'imagestart': 'gmmlaq', 'url': 'https://prnt.sc/', 'iterations': '20'}
        with open('config.cfg', 'w') as configfile:
            config.write(configfile)
        return config

# Save image from url.
def get_image_and_save(website_url, image_url):
    try:
        html_content = requests.get(website_url + image_url, headers=headers).content
        soup = BeautifulSoup(html_content, "lxml")
        #with open('image_name.html', 'wb') as handler:
             #handler.write(html_content)
        ourimageurl = soup.find(id='screenshot-image')['src']
        #print(ourimageurl)
        image = requests.get(ourimageurl).content
        with open(image_url + '.png', 'wb') as handler:
             handler.write(image)
    except:
        print (image_url + " was removed probably.")

def increment_image(image_url):
    return to_excel(from_excel(image_url) + 1)

config = get_config()
print ("Starting at '" + config["Screenshots"]["imagestart"] + "'.")

website_url = config["Screenshots"]["url"]
current_image_url = config["Screenshots"]["imagestart"]
for x in range(0, int(config["Screenshots"]["iterations"])):
    print("Currently downloading image " + current_image_url)
    get_image_and_save(website_url, current_image_url)
    current_image_url = increment_image(current_image_url)

# Set new config code to current location for next run.
config.set('Screenshots', 'imagestart', current_image_url)
with open('config.cfg', 'w') as configfile:
    config.write(configfile)

The bot requires Python, configparser and BeautifulSoup4. The scraper cannot handle numbers in the URL so please remove them and replace them with letters before picking a starting point, this was an oversight on my part.

Don’t do anything against their terms of service, Aidan.

The Cheap Raspberry Pi Security Camera

One of the great things about the Raspberry Pi is the community that works to create really great projects. I have setup a Raspberry Pi B looking out my windows. It faces the front door so can see anyone coming down the street and toward the door. I had a couple cheap $2 webcams lying around so I set them up looking out the windows. The total cost of the entire setup is about $10, minus the cost of the Pi itself, I also think that the Pi is a little underpowered for the task as occasionally the thing will stop working after several weeks, the camera still records the video, but the web-interface has to be reloaded in order to get the thing working again.

Capture

IMG_20160418_181953

Overall, I’d say that this project was ineffective for its purpose because unfortunately, the cameras would not record movement accurately enough, and sometimes would record several hours of minimal movement. An IP camera would likely be more cost-effective and better suited for the task, the Pi I used was simply underpowered to monitor two webcams and crashed after about 2 weeks of working continuously. Viewing the files showed that although it captured movement and video, some were corrupt, glitch or only captured about 3 usable frames. It did, however, show a good live view of what was going on outside the house, with about a 4-second delay.

Overall I would say that a Raspberry Pi as a webcam security camera on the cheap is not a good idea, the main contributing factor being that it was not able to keep the program running and often would save garbage to the SD card. If I were to do this again I would not use two for definite as it simply did not work effectively enough to actually increase security, It would often record trees moving for hours and sometimes one camera would freeze up entirely.

Six things ICT Provide

ICT is used globally, and has multiple points that are useful to it.

 

Fast Repetitive Processing allows companies and individuals to process large quantities of data at once and quickly, repetitive tasks allow for people to make personalised and tailored reports and information, things like bank statements can be processed rapidly overnight at low points in the day. New technologies mean that they can perform complex calculations quickly and effectively.

 

Vast Storage Capacity means that IT systems can store larger quantities of data in smaller form factor, large businesses are able to store large quantities of data and programs are able to actively process larger quantities of data such as virtual machines or hypervisors. Additionally It has allowed for free services to store files for free, using ad supported media.

 

Improved search facilities have allowed for people to lookup information and files instantly and obtain key information effectively, such as facts and simple calculations through Google and other search engines. Additionally It has allowed for data to be collated together, such as viewing reports and files over a period of time or created by a specific editor. Doctors, for example are able to lookup a patient’s details on a computer instantly, rather than go through paperwork and personal information, they are only able to see what is necessary, this allows for additional security and customisation with reports.

 

Improved presentation of data allows for tailored reports, statistics, graphs and files. Customisation has allowed business to tailor their products and software to their customers based off the data obtained. for example, a coffee chain is able to gain knowledge of their best selling drinks based off sales data and survey information, additionally they are able to combine independent data that would otherwise not be combined, such as weather and sales. It also means that it is easy to understand for someone who does not have a great understanding of a topic.

 

Improved accessibility means that information is available in a variety of formats and has allowed for people to access their data anywhere in the world, previously data was only accessible at one location, but by connecting to the vast internet, consumers as well as businesses are able to take advantage of fast connections and information in order to overcome logistical problems such as storage capacity and even uptime through online hosting, additionally people are able to utilise single peripherals on multiple devices, such as a printer or scanner. Additionally programs are easier to use and accessible to people with special needs.

 

Improved security has meant that files and programs are safe from third party transit prying eyes and nefarious use, data can be almost impossible to access without the right information and can be stored in ways that mean that it is not possible to decrypt remotely. It has also allowed for networks to be open to multiple tiers of people, from guests to teachers accessibility is only available to those who need it and useability is not compromised.