Having recently been tasked with finding out what is the best file hoster available on the Internet, I have chosen to approach the task in a slightly different way.
First, what is the “best” file hoster anyway? Well, most companies will always have similar and competitive prices; so that’s not the issue. What about speed? Well speed could be an issue but not really a major one, it might be possible to have one of the fastest connections to the hoster but they might not actually have many uploaders use them. So perhaps speed should be considered a secondary attribute to popularity.
But how do you measure popularity? You can’t use sources from the file hoster’s themselves, since they will likely say that they are the largest and most popular hoster. So where can such data be found?
Where can we search for how popular a website is?
There are a few ways of doing this, but the most useful would be to scrape information as to what file uploaders are currently using. For instance:
Some website use this method to identify what file hosters are used in a particular post - despite this, this method is not enforced, nor is there a universal method of displaying it. E.g. [MS][FS].. or [MS|fs|…
This is the first way of looking for popularity, the more accurate way would be to automatically follow theses posts and look for the actual links found within that post.
There are a few issues that need to be accounted for:
Not everyone uses the same title naming scheme.
Links within posts should only be counted once.
Some titles just use [Multi] to describe many file hosters.
Links can use obscuring services (e.g. safelinking.net) to hide the actual link / hoster.
I don’t want to spend an age building software that will answer this question, so scripts and pre-built software is key.
Wget to download the pages and posts.
One Perl script to count occurrences of title information.
One Perl script to look for links in posts.
Output all as CSV and time stamp it.
Run over a period of days collecting data.