Create Your Own Cloud

We have our team working in many parts of the globe and we use many Operating systems both in our computers and hand held devices. We needed a simpler way to weed out duplicate files and have a better version control system. In addition our team members have to access a common area to upload their photographs so that the designers and editors can work on them without individually signing up for multiple storage options.

What we need is a robust cloud system without the hassles of connecting to our central NAS (Network Attached Storage) file server through VPN (Virtual Private Network). In other words, we need a system – easy enough for the novice user whose specialty is not computer science. We have set up considerable amount of storage space in our file server and the only way to access this space was through a VPN. But setting up VPN in all devices for a new comer is a pain – both for the end user as well as for the technical help.

We explored Cloud options both paid as well as free. I summarize what we found at the time of writing this blog.

iCloud OneDrive GoogleDrive DropBox OwnCloud
Free storage 5GB 15GB 15GB 2GB
Paid Storage 50GB/$0.99/month

200GB/$2.99/month

100GB/$1.99/month

200GB/$3.99/month

100GB/$1.99/month

1TB/$9.99/month

1TB/$10/month
Supported OS Mac,iOS

Windows

Android

Windows

Android

Mac,iOS

Android

Windows

Mac,iOS

Windows

Mac,iOS

Android

Windows

Mac,iOS

Android

 

There are many reasons why you need to consider private cloud options:

  1. Scalability: A simple NAS server with a multiple TB hard drives configured in RAID can easily provide you with humongous storage space. With the increase in the numbers of pixels in Digital cameras and smartphone, you could easily notch up several GBs of photos in a given year. With your own private cloud system you don’t need to worry about exceeding your quota.
  2. Privacy: To me this reason alone is enough to set up our own cloud space to store our files. Some of our clients have insisted on a Non-disclosure contract which explicitly prohibits storing private files in any public service. If you opt for your own private cloud/file sharing system, you are responsible for the security and you can decide a complete security policy from the ground up.
  3. Lower Cost: In the long run, your own Cloud space will cost less than any public cloud space – cost per GB per year. Take our own example we have setup 5 TB cloud space and we have not spent a dime on any new hardware or software. We simply configured our existing NAS server to launch a cloud service.
  4. Control: You get to decide how long a photo remains ‘alive’. Many public cloud services allow you to retain the original photos only upto a certain time limits. Life expectancy of your precious images and files is not under any cloud owing to the cloud service you have opted. You have complete control over your resources. For me it seals the deal.

 

How-to:

 

Before we start we make a couple of assumptions – that you have a linux box with internet connectivity and you are comfortable with CLI.

The folks who wrote the documentation at the Owncloud are quite helpful and they have written quite a lot. They provide extensive documents in the form of PDF for the user and administrators amongst others. If you dig into the owncloud directory created at the time of installation, you will notice that there is a ‘README’ file as well.

You would be well advised to wade through these documents for a complete grasp of this cloud.

But as I can assure you one thing: that sometimes even the best of documentation can leave you puzzled for hours for solving the simplest of issues. Trust me – I started my career as a technical writer and in my time, I have written volumes on technical subjects. The first question we ask is ‘Who is the target audience for this document ?’

The rest of the manual will fall in place once we identify the target. How you write for the user manual is completely different from how you write for the service manual. The tone and the language will be vastly different from each other.

Still you will find a common underpinning in both the version – verbiage. It is probably the bane of technical writers. In a valiant effort to make our work as easy to understand as possible, we probably end up using too many words in our narration. The Shakespearean nugget of wisdom – ‘Brevity is the soul of Wit ‘ is probably found in wanting in most documentation.

This page is an attempt to chip away the excess verbiage and present a simple, direct ‘how-to’ to set up your ownCloud server in your linux box and setup any device to access the cloud.

You will find that I had set up access to the owncloud from Windows, Apple Mac mini, iPhone and iPad.

According to the official document –

“For best performance, stability, support, and full functionality we recommend:

  • Red Hat Enterprise Linux 7
  • MySQL/MariaDB
  • PHP 5.4 +
  • Apache 2.4”

Whilst at it make sure that you have updated your linux.

There are some dependencies your linux installation needs before you can proceed with the rest of the installation:  – mod_php – php-mysql – php-json – php-gd – php-mb_multibyte – php-ctype – php-zip

Owncloud is written in PHP and it needs the recent PHP version. But you need not worry about the dependencies or the version. The package installer will check for dependencies and install any additional requirements.

Setting up Owncloud in the Linux Box: We assume that you are on Centos linux. You should select the option from here (https://download.owncloud.org/download/repositories/stable/owncloud/)

Select your Linux OS

  1. Before you install the cloud, you need to add the following as root from the command line:

[code]
rpm –import https://download.owncloud.org/download/repositories/8.2/CentOS_6/repodata/repomd.xml.key
[/code]
At the time of writing, the owncloud version is 8.2 and so I issued the following:
[code]
wget http://download.owncloud.org/download/repositories/8.2/CentOS_6/ce:8.2.repo -O /etc/yum.repos.d/ce:8.2.repo

yum clean expire-cache

yum install owncloud
[/code]
Installation in CentOS Linux

Post Installation Screen
Once the files are installed you will need to configure the server. You will find a directory called as owncloud in the document root.  Start with setting the right permissions like so:

chown -R apache:apache owncloud

And point your browser to the server like this:https://your-server-ip-address/owncloud

You will be presented with this screen:
Start screen
If all is well, you can continue with adding your IP address as a ‘trusted domain’ . Then it will ask you to set your user name and password.
All questions and directives are easy enough to follow without breaking into sweat.

Administrator Set up Page
With that you are done and your cloud is up and ready. But you need to create access to this spanking new cloud for all your users’ devices.

The easiest way is to connect through a browser. Point your browser at the IP address through https.

This is the access from Firefox browser in Windows:
Access through Firefox Browser in Windows
And this screen is from the Windows App provided by OwnCloud:

Windows Desktop Owncloud App

This is from Apple Mac Mini’s App:

OwnCloud has Mac OS App
This screen is from an iPad:

Apple iPad connecting to OwnCloud
You can set up additional users within the same group and they can share cloud space. Supply the credentials and you will be logged into the cloud.

By default, you have 2 directories : Document and Photos. Very convenient, eh ?

You can upload your photos and documents from your portable devices to the cloud. Here is an example of someone getting ready to upload a set of images from an iPad:

Uploading images to ownCloud from iPad
That brings us to the end of this page. I will leave you to explore more of your OwnCloud at your own pace.

How to Identify Bad Bots

Every Webmaster knows that a considerable part of the traffic comes from Search Engine Robots which in most cases benefits the site. All mainstream search engine robots (Bots as they are called in our circles) obey the directives found in the robots.txt protocol and exhibit normal crawling behavior. But there are bad bots which not only consume considerable bandwidth without offering any tangible benefits to the Webmaster, but also siphon off content to be used elsewhere, or even cause spurts in resource requests resulting in slowing down the server.

It is always an onerous task to keep these malevolent bots in check. As a system administrator for dozens of production servers over the last 15 years, I can tell you one thing – that keeping the bad people, and their bad bots away from their targets, takes up a sizable time.

There are many ways to handle this and every Webmaster I have come across, handles the situation in her/his own way. As long as you stay ahead of them, whatever the technique/method you use is probably working and at the end of the day – that is all that matters.

We often come across queries in online communities about the best technique to manage this overwhelming task. This How-to ‘on identifying the bad bots’ is intended to help the average webmaster to get a handle.

Bad bots generally disobey the directives in robots.txt or they read them to look for prohibited directories to crawl. Some times it need not be a fully grown robot that sucks up your bandwidth. It can be a home-grown ‘grabber’ that can be unleashed – using utilities like wget or curl.

Not all bots are bad. Other than the search engines, some robots are operated by other legitimate agencies to determine the best matching campaign for a page’s content for a potential advertiser or to look for linking information or to take a snapshot for archiving purposes. We have trawled through our server log files to compile a list of robots you are likely to find in your server log here: Bot List .

Tip:You can scroll through the list at your leisure or if you are impatient like me, start typing the name of the robot in the form and it will show you the probable match.

The list contains bots with identifiable information given in their User-Agent field. When you browse through the list, you will also find that many major search engines switch User-Agent strings as per their need.

For example Google may use any of the following:

Mozilla/5.0  Googlebot/2.1;

Mediapartners-Google

Googlebot-Image/1.0

DoCoMo/2.0 N905i(c100;TB;W24H16) Googlebot-Mobile/2.1;

 

or anything that suits its fancy.

 

 

This list of bots shows bots – most of which obey the robots txt protocol. In case you block them in your robots.txt and still they show up, don’t break into a sweat. It may have cached your robots.txt and it may have an older directive. Wait for 24 hours before you decide to take any punitive step.

 

Identifying the Bad Bots

 

Start with the server log files. Download the recent or the last month’s server log file for analysis from the server. Or if you want to take a look at the snapshot of the last 300 entries from the server log files shown from a cPanel as here: Latest Visitors → your domain name – > View → Scroll down and hit the ‘Click Here for the legacy version’ . It will show you a list of all the last 300 requests.

 

The following is a part of such snapshot. It shows you the bots from Yahoo and Bing crawling some pages from TargetWoman site. As you can see, the regular search engines will request a copy of the robots.txt before they crawl the site. They will cache the robots.txt file for a time so that they don’t need to keep requesting this file all over again.

Server Stats

One of the techniques to isolate the bad bot is to set up a ‘spider trap’ where you add a hidden link in the body of your main page. It is hidden from the human visitors but can be reached by any crawler as it parses the main page. This link is blocked in the Robots.txt and normal bots will avoid this link. But the bad bots will still go ahead and crawl this page.

To give you an example:
Block bots.cgi in the robots.txt like so:

User-agent:*
Disallow: /cgi-bin/bots.cgi

The following is a sample Perl Code to demonstrate how this simple script can write a log file of all the bad bots – with a few details:

#!/usr/bin/perl

# Script to add Bad Robots who disobeyed the robots.txt to a list
# Script by TargetWoman.com

# Date formatting
($sec,$min,$hour,$mday,$mon,$year,$wday)=(localtime(time))[0,1,2,3,4,5,6];
$time= sprintf("%02d:%02d:%02d",$hour,$min,$sec);
$year +=1900;
$mon +=1;
# print date in the way we want 1st March,2014 etc
if ($mon == 1) {$mono="January";}
elsif ($mon == 2) { $mono="February";}
elsif ($mon == 3) {$mono="March";}
elsif ($mon == 4) {$mono="April";}
elsif ($mon == 5) {$mono="May";}
elsif ($mon == 6) {$mono="June";}
elsif ($mon == 7) {$mono="July";}
elsif ($mon == 8 ) {$mono="August";}
elsif ($mon == 9) {$mono="September";}
elsif ($mon == 10) {$mono="October";}
elsif ($mon == 11) {$mono="November";}
elsif ($mon == 12) {$mono="December";}

if ($mday == 1) {$dayoo="1st";}
elsif ($mday ==2) { $dayoo="2nd";}
elsif ($mday ==3) { $dayoo="3rd";}
else  { $dayoo="$mday th";}

if ($wday ==1) {$weekday="Mon";}
elsif ($wday ==2) {$weekday="Tue";}
elsif ($wday ==3) {$weekday="Wed";}
elsif ($wday ==4) {$weekday="Thu";}
elsif ($wday ==5) {$weekday="Fri";}
elsif ($wday ==6) {$weekday="Sat";}
elsif ($wday ==7) {$weekday="Sun";}

$timo="$hour:$min: $weekday, $dayoo,$mono,$year";

# Collecting details about the pesky bots
my $ua= $ENV{'HTTP_USER_AGENT'};
my $ip= $ENV{'REMOTE_ADDR'};
my $ref= $ENV{'HTTP_REFERER'};

# Time for Action
my $file="temp/bad-bots";
open( LOG, ">>$file" );
print LOG "$ua\|$ip\|$ref\|$timo\n";
close LOG;

print "status:200 OK\n";
	print "Content-type:text/html \n\n";
	exit;

Blocking Bad Bots

Check this bad-bots file later for further remedial action.
How to identify Bad Bots

There are many ways to deny access to these unwelcome bots.

Option 1:

You can check the IP address against a white list ( you add your own IP address as well as that of major search engines in this white list) and the final IP addresses can be blocked in the firewall.

 

Or assign the User-Agent string to a deny list which can result in 403 – status (Forbidden). It uses less server resources.

 

Many of the Content Management Systems (CMS) use a technique where a request for a non existent page triggers their engine to parse the request to deliver content pulled from a database. Dynamic pages would inherently utilize more server resources than static pages unless they are cached. If CMS can be made to check for the User-Agent string and send a 403 page instead of ‘preparing’ the content, it can save quite a bit of resources.

 

For example, one of our sites uses a CGI script in our CMS. The following snippet of code will send a 403 – Forbidden status to User-Agents wget and curl:

 

if ($ENV{'HTTP_USER_AGENT'}=~/wget|curl/i) {
	print "status:403 Forbidden\n";
	print "Content-type:text/html \n\n";
	exit;
}

 

Option 2:

You can use .htaccess to block the bad bots assuming that you use the Apache HTTP server. In case you have a few Bad bots which use a particular User-Agent string regularly, it is easy to block them based on that string.

 

SetEnvIfNoCase User-Agent "^Wget" bad_user
SetEnvIfNoCase User-Agent "^Riddler" bad_user

Deny from env=bad_user

But it may not work out all the time. Most bad bots change their User-Agent strings often or would mimic a normal browser string. As of now few handlers of these bots keep up with the version of the browsers they try to simulate. We have seen IE version 5 or FireFox version 3.x too.

 

As I said earlier, this is an ongoing process to ward off bad bots. We have used different techniques on different occasions. We share a few here. If you have something better, we would like to hear about them – Please share them here in the comments.

 

Thank you for being with us all the way here !

Reduce Page Load Time

Search Engine Giant Google has recently indicated that it may consider the speed of your website as one of the ranking factors in future. As a Webmaster, you know this all along – a faster loading page added to the positive feel. Ironically most designers create elaborate visually appealing eye candies to enhance the user appeal without realizing that the extra loading time actually detracts from the positive note.

This blog will take a look at the techniques and best practices recommended by Google through its Page Speed http://code.google.com/speed/ and Yahoo’s Best Practices for Speeding Up Your Website – http://developer.yahoo.com/performance/rules.html

If you check out these recommendations, you will realize that some of the best practices are primarily intended for the corporate websites. As usual we try to distill the collective wisdom into a condensed format useful for the average Joe Webmaster. The tips you will find here could be used to speed up the access time of your pages without having to break your bank or the need to hire a Rocket Scientist to re code.

The summary of optimizing your pages and reducing the page load time:

1. Reduce file size:

1a. Start with your images: Optimize your images:

Do you know that our TargetWoman – which is primarily a content rich site still accounts for just a little over 54 % of its bandwidth consumption to image traffic ?

Simple approach – the larger your file is – whether it is a flash file or an eye candy high resolution image or a HTML/CSS/Javascript file, the longer it takes for the visitor’s browser to access. If it is an image, ask your designer to selectively manipulate or compress the image. A carefully optimized image can often bring down the file size to half its former size.

For static content optimize browser/proxy caching by setting longer expiration time. If you are in an Apache server the following snippet of code in your .htaccess can help caching your image files:


< FilesMatch "\.(jpg|jpeg|png|gif|swf)$" >
Header set Cache-Control "max-age=2592000, public"
< /FilesMatch >

The max-age figure allows caching upto 30 days. If you check your server’s header information, it will return the following (example):


HTTP/1.1 200 OK
Date: Mon, 30 Nov 2009 11:17:00 GMT
Server: Apache/2.2.14 (Unix)
Last-Modified: Wed, 25 Nov 2009 12:44:26 GMT
ETag: "1a5a0f1-67ec-479316b377e80"
Accept-Ranges: bytes
Content-Length: 26604
Cache-Control: max-age=2592000, public
Connection: close
Content-Type: image/jpeg

Declare your image’s width and height value in HTML to help the browser to render the image faster.

1b: Optimizing Code:

When it comes to CSS optimization, avoid CSS expressions and combine many files into one. Remove redundant directives and minify the file. Same way combine several Javascripts into one external file and minify it. Make references to CSS and Javascript in the header of the HTML file. If you use Google Page Speed, you can use closure compiler which can minify Javascript – available from here.
On the other you can use YUI compressor from here to minify too.

Another feature of Google Page Speed is its ability to compress images (if you reference to a page with images) and save them to a configurable directory locally. You might find this a time saver. This alone is a reason to download the Page Speed for your Firefox browser. Head to this link for using Page Speed .

You would need Firebug Addon to your Firefox browser before you can use Page Speed.

If your ASCII content (Javascript and HTML) runs to more than a few tens of kbytes, you should consider Gzipping these components. Saves on bandwidth costs as well as on loading time.

2. Server Side Tweaks to reduce the page load time:

Reduce HTTP requests by combining several components into one. We had mentioned in the previous paragraph how you should combine many Javascript files into one. We also offered a tip to use a longer expiry time for cacheable images in Apache servers.

Avoid redirections as they add to the repeat traffic between the server and a browser.

Weed out 404 errors from your pages. If you use a clever 404 error trapping script to deliver an intuitive alternative to the visitor like this site, it is all the more reason to be wary of 404 errors. A careless reference to a non-existent image can fire up the engine which services the error trapping call whilst the browser waits for the continuation of other components.
Reduce Page Load time

Reduce DNS lookups – every time a browser encounters a new domain name, it can take up to 20-120 milliseconds to look up the IP address. This latency can be minimized if you use the IP address directly. Split components across domains to enhance parallel download.

Cookies: Serve static contents from a cookie less domain. Domains which serve cookies take up additional back and forth traffic which is not required for serving static contents. Whilst at this, reduce the size of the cookie set by your server.

Use consistency in referencing resources: It is quite a mouthful to just say it. You can save quite a bit of time by adopting consistency in your references. For example, you have a Carousal which needs a longish Javascript, you must make references to it like so: www.domain.com/carousal.js for all pages residing anywhere in your domain or subdomains. Absolute URLs to a consistent reference point shaves off time perceptibly.

Our Parent site TargetWoman - the leading women portal presents painstakingly researched extensive information in the form of thousands of condensed pages. It offers the widest and the most detailed information on subjects women care.