Making a Web Scraper to Download Images off the Internet

One afternoon I read on a popular website that http://prnt.sc/ uses sequential 6 character codes to host user images on their website, this made me wonder what was on there.

The next day I made a small bot to scrape the website and collect all images through a range and then the bot could run multiple times to collect more images if necessary. I left the bot running for a couple of hours and here’s what I managed to find, I’m sure I cannot re-host the images but the range I scraped through was gmmlaq for 1,287 images before the bot was IP banned through Cloudflare, fair enough. I took the time to view each image individually.

Here’s What I Saw

  • A drivers licence and matching passport which was expired.
  • A WordPress username and password combination for a web-host reseller which I did not test.
  • Many Many out of context conversations, half of which were in Cyrillic.
  • A teacher seemingly contacting students and recording the fact they did not pick up through skype.
  • Ominous pictures of a tree posted multiple times.
  • Screenshots of video games, mainly Minecraft, Runescape, Team fortress 2 and League of Legends.
  • A lot of backend-databases of usernames and email addresses for customers and users, in fact, they are a large proportion of the screenshots.
  • A lot of SEO spam.
  • A conversation between two users through skype debating over banning an influencer from their platform for fake referrals.
  • About 2 lewd photos.
  • A few hotel confirmations.
  • Whole credit card information including CVV and 16 digit number.
  • A spamvertising campaign CMS platform.
  • A gambling backend database disabling access to games for specific users.
  • One 4×4 pixel image and One 1×47 pixel image.

What Did we Learn?

  • Stuff like this, particularly URLs should not be sequential.
  • A lot of users on the platform see the randomness of the URL as sufficient security however, its undermined by the fact the website can be scraped sequentially.
  • They did eventually ban the bot after 1,287 images, which is probably closer to 1,500 images before testing however Cloudflare seems to be the one preventing access, so it may be a service they offer.
  • A lot of users on the platform are web developers and use every trick in the book to boost their numbers.
  • A lot of users are Eastern European and American.

How I Made the Scraper

I made this bot using Python 3.7 however it may work on older versions. The URL is base 26 encoded to match the alphabet, incremented and then converted back to a string for scraping. Images are saved with their counterpart names. I do not condone running the scraper yourself.

import requests
import configparser
import string
from bs4 import BeautifulSoup
from functools import reduce
# Scraper for https://prnt.sc/
# Headers from a chrome web browser used to circumvent bot detection.
headers = {
"ACCEPT" : "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9",
"ACCEPT-LANGUAGE": "en-US,en;q=0.9",
"DEVICE-MEMORY": "8",
"DOWNLINK": "10",
"DPR": "1",
"ECT": "4g",
"HOST": "prnt.sc",
"REFERER": "https://www.google.com/",
"RTT": "50",
"SEC-FETCH-DEST": "document",
"SEC-FETCH-MODE": "navigate",
"SEC-FETCH-SITE": "cross-site",
"SEC-FETCH-USER": "?1",
"UPGRADE-INSECURE-REQUESTS": "1",
"USER-AGENT": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.138 Safari/537.36",
"VIEWPORT-WIDTH": "1920",
}
# https://stackoverflow.com/a/48984697/2697955
def divmod_excel(n):
a, b = divmod(n, 26)
if b == 0:
return a - 1, b + 26
return a, b
# Converts our '89346963' -> 'gmmlaq'
# https://stackoverflow.com/a/48984697/2697955
def to_excel(num):
chars = []
while num > 0:
num, d = divmod_excel(num)
chars.append(string.ascii_lowercase[d - 1])
return ''.join(reversed(chars))
# Converts our 'gmmlaq' -> '89346963'
# https://stackoverflow.com/a/48984697/2697955
def from_excel(chars):
return reduce(lambda r, x: r * 26 + x + 1, map(string.ascii_lowercase.index, chars), 0)
# Load config or start a new one.
# Image start is random
def get_config():
try:
config = configparser.ConfigParser()
with open('config.cfg') as f:
config.read_file(f)
return config
except:
config = configparser.ConfigParser()
config['Screenshots'] = {'imagestart': 'gmmlaq', 'url': 'https://prnt.sc/', 'iterations': '20'}
with open('config.cfg', 'w') as configfile:
config.write(configfile)
return config
# Save image from url.
def get_image_and_save(website_url, image_url):
try:
html_content = requests.get(website_url + image_url, headers=headers).content
soup = BeautifulSoup(html_content, "lxml")
#with open('image_name.html', 'wb') as handler:
#handler.write(html_content)
ourimageurl = soup.find(id='screenshot-image')['src']
#print(ourimageurl)
image = requests.get(ourimageurl).content
with open(image_url + '.png', 'wb') as handler:
handler.write(image)
except:
print (image_url + " was removed probably.")
def increment_image(image_url):
return to_excel(from_excel(image_url) + 1)
config = get_config()
print ("Starting at '" + config["Screenshots"]["imagestart"] + "'.")
website_url = config["Screenshots"]["url"]
current_image_url = config["Screenshots"]["imagestart"]
for x in range(0, int(config["Screenshots"]["iterations"])):
print("Currently downloading image " + current_image_url)
get_image_and_save(website_url, current_image_url)
current_image_url = increment_image(current_image_url)
# Set new config code to current location for next run.
config.set('Screenshots', 'imagestart', current_image_url)
with open('config.cfg', 'w') as configfile:
config.write(configfile)

The bot requires Python, configparser and BeautifulSoup4. The scraper cannot handle numbers in the URL so please remove them and replace them with letters before picking a starting point, this was an oversight on my part.

Don’t do anything against their terms of service, Aidan.

Scraping Canvas (LMS)

Because my time at university is ending I thought it best to archive the canvas pages available to me for later reference should I not be able to access canvas later if they change platforms or disable my account. I should probably add this is for archival purposes and I will not be able to share the data I was able to collect. Thankfully I was able to get the whole thing going in a few minutes and downloading took a lot longer.

The first snippet I got from here, didn’t complete the first time, it seemed some image was causing issues so I moved to another gist, at this rate we could be done in half an hour ūüėä.

Unfortunately it also borked out on a similar place,

FileNotFoundError

I think it is because there’s something missing or I don’t have access to it. But the real problem is that its downloading content for a course I didn’t care about because I was enrolled in it but it’s full of junk I’m not interested in, so we can remove it by using the second scrapers code and specifying the course id’s which I had to manually go through, there was about 15 of them but it didn’t take too long. Which gave me the full command.

F:\Downloads\canvas>python canvas.py https://canvas.hull.ac.uk/ 4738~DUI9Nha9weSuemu1M2qsmhljoBcQtR0zghXTs3QA7ECHDHQkpsgBQ9RllbaEwySf output 52497,56148,56149,52493,54499,54452,54456,53441,52496,22257,22274,22276,22277,22278,22279,22280,50664,50656,22275,50652

The access token you can see above should be expired by now. You can do it yourself by downloading the same file and installing python3, pathvalidate and pycanvas. You need to generate a security token from /profile/settings and you can get the course id by clicking on the course like this /courses/56149 . When you generate a new token you should receive an email about it.

Canvas online with our starred modules displated.

I decided to make a small adaptation to catch the FileNotFoundError and went off to the races. It took over an hour so I decided it was best to leave it running overnight, when I returned in the morning I had 116 errors (failed downloads) and the rest is the course content!

Our Canvas Modules saved to Windows File Explorer.

Unfortunately I don’t seem to have the submissions for each of these courses so I needed to manually download them aswell and then our archive was completed.

Thanks for reading.

Exporting GnuCash Data to PowerBi

Some things are better if you do them yourself. I mainly did this project to keep a running ledger of the changes that I would need to keep track of the account balance data.

GnuCash is great, but when I export my accounts data the CSV file isn’t easily translated with power-query automatically. I decided that because I need to keep track of my transactions, this problem was best suited with a little program to calculate my net inflows and outflows using a python program, I also decided halfway through the project that I wanted to hook it up to my graphing backend, a silly idea – but a fun one to see my spends go up and down. I decided that I’d publish it here so that in the future I would find it a lot easier.

Traditionally if you wanted to export GnuCash data from CSV to PowerBi, you’d be better off using their inbuilt power query, however, I wanted to implement a ledger system, something that I don’t think can be accomplished directly in PowerBi without some scripting, and in the future I want to be able to change platforms if I need to because I no longer have a licence for PowerBi or want to use something else like excel or free equivalents like Google Sheets. I reckon that if done properly Google Scripts could make everything run automagically from an upload, but I don’t have my reports ready yet for that to happen.

# Extracted from here https://github.com/aidancrane/GnuCash-CSV2CSV-for-PowerBi/blob/master/convert_to_powerBI.py
import csv
from re import sub
from decimal import Decimal
import time
# If we only want the transactions values, set this to true and they will print to console, they do not save to accounts_out.csv!
numbersOnly = False
# Anyone thats not me will want this to be false, this is used to show the transaction data on a live graph I use for scratching around with.
smoothieChartEmulation = False
sessionCookie = "39494goodluckguessingthispartlololol213232expiresanyway"
# Leave this here so that Notepad++ and Atom auto-suggest it.
# Date,Account Name,Number,Description,Notes,Memo,Category,Type,Action,Reconcile,To With Sym,From With Sym,To Num.,From Num.,To Rate/Price,From Rate/Price
if (smoothieChartEmulation):
import requests
# Negative Numbers are bracketed when exported from GNUCash so we need to fix that for the float data type.
def convert_if_negative(number):
returnNumber = str(number)
if ("," in returnNumber):
if ("(" in returnNumber):
returnNumber = returnNumber[1:]
returnNumber = returnNumber[:-1]
returnNumber = returnNumber.replace(",", "")
returnNumber = 0 - round(float(returnNumber), 2)
returnNumber = str(returnNumber).replace(",", "")
if ("(" in returnNumber):
returnNumber = Decimal(sub(r'[^\d.]', '', returnNumber))
return (0 - round(float(returnNumber), 2))
return returnNumber
# open accounts.cvs, our exported file.
with open("accounts.csv", "r") as csvIn:
reader = csv.DictReader(csvIn)
entries = []
runningTotal = float(0)
# Save
for row in reader:
if (row["Account Name"] == ""):
pass
else:
runningTotal = runningTotal + float(convert_if_negative(row["To Num."]))
if (numbersOnly):
print(str(round(runningTotal, 2)))
else:
if (smoothieChartEmulation):
payload = {'random_graph': runningTotal}
r = requests.get('https://dash.infinityflame.co.uk/dash/flex.php', params=payload, cookies={'PHPSESSID': sessionCookie})
print(str(convert_if_negative(row["To Num."])) + " Description: "+ row["Description"] + " Account Balance: " + str(round(runningTotal, 2)))
entries.append([row["Date"],row["Description"],row["Category"],str(convert_if_negative(row["To Num."])),str(round(runningTotal, 2))])
# Save what we care about to our new csv for power BI   
with open('accounts_out.csv', mode='w', newline='') as csvOut:
titles = ["Date","Description","Destination","Transaction","Account Balance"]
writer = csv.writer(csvOut, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
writer.writerow(titles)
for transaction in entries:
writer.writerow(transaction)

I used python because it’s quick to debug in Atom and I know it well. The way this snippet is used is to export an account from GnuCash, then use this program to output the final CSV to be taken in by PowerBi, I’ve chained these steps together. The final result is stored in accounts_out.csv.

Thanks!

Simulating a phone combination brute force

 

If a malicious individual were to steal your android or iPhone, plug in a device to emulate a keyboard and have it test every single pass code possible, it would take a while, using the following tutorial, you can calculate the time It would take to do so.

Screenshot_2016-03-02-16-46-31

 

 

 

 

 

 

Firstly, you need to grab Python 3.4.3, or you can probably use the version you have installed. Next we need to create the code.

Firstly we need to import datetime to convert the guesses into time it would have taken, we also need to write down what the combination is, for this example, it will be ‘3502’.

import datetime
combination = "3502"

print (" [Info] Starting")

Then we need to add a guess and how long has passed while performing a guess, as it takes time to enter the numbers into the device, we will simulate this as well as 1 second.

guess = "0000"
seconds_taken = 0

def addsec(seconds):
     global seconds_taken
     seconds_taken = seconds_taken + seconds

I could have added the seconds section into the code directly, but adding as a def allowed me to edit it if I needed to, now that we have done the basics, we need to start guessing, there are 10,000 possible combinations, thats combinations such as 0001, this is problematic as leading zeros will not be carried over into integers in python, we can fix this using .zfill(4), which will add the leading zeros back into the guess, allowing us to compare it with the actual combination. This also means that we can convert the guess back into an integer in order to see if we have exceeded our limit. We also need to add a second for a combination guess.

def addsec(seconds):
     global seconds_taken
     seconds_taken = seconds_taken + seconds

while int(guess) <= 9999:
     addsec(1)
     if guess.zfill(4) == combination:
          print (" [Alert] Combination guessed, combination is " + combination)
          break
     else:
          guess = str(int(guess) + 1)
          print (" [Info] Guess is now '" + str(guess).zfill(4) + "'")

Finally, we need to convert our result into a time, we can do this by dividing our seconds_taken (which is coincidentally the number of guesses if you add one for ‘0000’) by 5 (because it takes 5 guesses before a penalty), and then tuning that into an integer, rounding down and then multiplying by 300, to simulate 5 minutes lockout. then we combine penalties_incurred and seconds_taken, to get the time it takes to guess the combination (in seconds), then use that to convert into an hh:mm:ss format, using datetime .

penalties_incurred = int(seconds_taken / 5) * 300
time_taken = (str(datetime.timedelta(seconds=(penalties_incurred + seconds_taken))))
print (" [Finished] The combination would have taken '" + time_taken + "' to brute force. (h:m:s)")
print (" [Finished] You would have had to wait for " + str(int(penalties_incurred / 300)) + " lockout session(s)" )

What have we learnt?

  • There are 10,000 possible combinations.
  • For my combination, it would take 6 days, 30 minutes to guess.

On an Android Device,

  • It would take over 2,000 lockouts to guess every combination.
  • It would take 7 days, 1 hour, 26 minutes and 40 seconds to guess every combination.
  • It would take 8 hours, 28 minutes and 20 seconds to guess 500 combinations.
  • It would take 50 minutes and 50 seconds to guess 50 combinations, with 10 lockouts.

On an Apple Device*,

  • It would take 1666 lockouts to guess every combination.
  • It would take 5 days, 21 hours, 36 minutes and 40 seconds to guess every combination.
  • It would take 7 hours, 3 mintes and 20 seconds to guess 500 combinations.
  • It would take 40 minutes and 50 seconds to guess 50 combinations, with 8 lockouts.

*However, apple wipes their devices after 11 bad combinations, to avoid this, the combinations would have to be entered correctly after the sixth try in order for the apple device estimates to be correct, which defeats the purpose of brute forcing, for that reason apple devices are much more secure, however there is potential for data to be deleted accidentally.

This simulation is flawed because,

  • It does not take into account combinations greater than 4 digits
  • It does not take into account cumulative waiting times
  • It does not take into account device combinations that don’t involve numbers
  • You could increase the number of digits allowed in order to calculate your combination, for example if it was 67890, replacing the 13th line with 99999 would allow you to calculate it.

Here is the full code extract,

Using Hashlib to Securely store user passwords and credentials.

What is hashing?

Hashing a password means that users cannot have their passwords compromised when a database engineer is reading cleartext in user databases (to a degree, the passwords could be decoded, but hashing them makes them illegible to someone who is not doing anything extensive). And also prevents hackers from reading passwords in plain text and can be compromised by collision attacks.

Additionally when hashing a password a salt may be added to the password, this prevents a database from being attacked by dictionary attacks.

Why Hash Passwords?

Storing User credentials in Plain Text is generally as bad practice as it allows anyone who reads the file (or computer) to see the password, username or any other credential without any sort of protection, In some cases it is against the law, such as PCI SSC Data Security Standards which handles debit and other card types. The solution to this is to Obfuscation in the form of hashing. Hashing a password makes a standard password seem completely random.

How hashing works

When a user signs up for a website or any other form that requires secure credentials, such as a password, username, email address or address, that user will fill in a form that will ask these credentials, then the web server will both hash and store the hash, the server will ‘throw away’ the original password and keep the hash. In a more secure environment the user may also be given a salt, this may be unique to the user or unique to the application (The user will not know the salt, the salt is owned by the server and will be kept secret.). When hashing both the password and salt will be combined and hashed.