In IT, errors or mistakes are always calculated with respect to the impact that caused. I mean everything like the downtime or outage etc. that affects the business. Some times very small mistakes or error can cause the entire application down, that can lead to million dollar loss.

So here I share best practices for the Admins,

There is lot of difference between an Application guy and System admin guy making messy of things. When an application guy has messed up something, it affects only that particular application which he/she is responsible, but on the other hand when System guy messes up something, it affects all the application which has been hosted on that BOX/Server.

So here are the messy things, lets start the count with low binary 😉

  • Messy 0:

We all know touch command will create new file. There is big difference between ‘vi’ and ‘touch’. vi will open the file, if the file already exists, if not exists it will create the file. whereas ‘touch’ not every time will create a new file, If file exists it will update the timestamp, and if file does not exists it will create. Many time “touch” is used to check, wheather you have write permission for the directory. These may look very simple unless, we may not able to start the application [Like, MQ, the Queue Manager not able to start without logs ] so it will take 2hrs of the Backup Team to restore the original log file. We are not saying that you don’t know the command, but the impact caused by the command.

Lesson Learned :

  • Don’t do any thing in urgent.
  • Know the impact of every command you type.
  • Messy 1:

Here is the another story, I am installing BO (Business Object) via Putty connection Manager. I opened many connections to the target server where I want to install BO. [The very 1st connection is Master and all other connections are slave (here connection refers to ssh login to the server) ]. I checked the pre – requisite and I started the installation on slave connection. And I started closing the other connections which is not necessary, For one terminal connection, I closed it forcibly, then I come to know that I closed the master connection so, all the slave connections are terminated. So obvious reason My BO installation got struck in the middle, it took one full day for me to sit with DBA Team to clear the schema.{like ticket creation, approval from my team and so on 🙁 🙁 }

Lesson Learned :

  • Always do the necessary connections and know where you are.
  • When you know it takes more time, I should have run the scripts on server side itself. by adding ‘&’.
  • Slave connection will be terminated, if master connection got closed.
  • Messy 2

I have given the green signal to the configuration team saying that necessary backup has been taken. Under the directory where I took the backup many sub directories, files and short links were there. For some directories, short links have been linked to different location. Configuration team set up the new configuration, and for some reason they want the old configuration file which should be available on my zip (the backup). well you can guess what happened next.

Lesson Learned :

  • We can’t take backup of the short link.
  • And always validate your backup.
  • Messy 3:

You are asked to provide the last one month failed login attempts for the XxAIiBOx23. So you logged in to server by typing the server name on the putty, because there is a huge difference between i,I,1

Lesson Learned :

  • Some times, Copy and Paste is good. 🙂

 

  • Messy 4:

Without a proper check, deploying the Public key to 4000 Physical servers.

Lesson Learned:

  • When there is bulk, check all the possible way.

 

  • Messy 5:

I have been assigned as incharge of an educational website hosted on the server. For the testing the purpose we copied the exact replica of one more website to our website. The site which we copied has option of accepting donation on “help us” tab. So our site got listed in the forgery website saying that we are storing the Credit card details etc., [ so the search engine crawled our web and stored in their database, some random user flagged our web as forgery ]. This may look very simple and routine, unless you get a official warning mail from Central Govt. and from your ISP.

Lesson Learned:

  • I should have put valid entries on “robots.txt ” file, saying not to crawl particular directory for Search Engines.
  • Add valid Disclaimer.
  • Messy 6:

I have been assigned for updating the Application patch on the server. I logged in to the target server and Repository Server (where the patches and binaries are available). I see patch bundle size and file name on both Target Server and Repository Server are same. So I ignored copying the patches (will hardly take 2 min), and I applied the patches which is already available on target server actually its older version which I don’t know. And I applied the patch successfully. Obviously, this action made me very famous but it also took four team with different geographic location to fix, Here “fix” means we scrapped the entire applications and rebuild the same 😉 ;).

Lessons Learned :

  • when every thing is same like size, file name. compare the “checksum”.

 

Which errors impacts your business more:

  1. Urgency in the delivering the fix, request or patch.
  2. Our attitude towards making our basket/pool to be ‘0’. In some cases that cannot be possible.
  3. Lack of interest and sleep . Frustration@work.
  4. There is no proper Plan at the time of migration/updates.
  5. Don’t simply follow the document.
  6. Carelessness. Like reading and reacting to the ticket “the Server decom actually it is written as Service decom”. [ this act is called “spoonerism”].

The Best practice:

  • Please have practice of checking often present working directory and the server name before executing/installing.
  • Always go with the System date and time.
  • Be responsible, you know the impact of every command you type, don’t type anything like saying, ‘lets see what happens by this command’ because outdoor games should be played only outside.
  • Give proper respect to, Production, Pre-Prod, Development and Test servers.
  • If you are not happy with settings, change the settings on color of fonts, font size etc.
  • When you come across something strange, audit by yourself.
  • Check for all the SIMPLE things, be responsible for port number which is opened , File permission, Ownership, FS check, Allocating CPU, Process check, Iptables, IDS. Ssh attempts, Check the user levels history.
  • And check for sudo permission which you have given to the other user, revert the permission back if not necessary.

 

Note:

Under stand the Basic Principle : When there is a success its called Team Work, But if there is mistakes/errors it’s always blamed individually. All the above are the experiences of self and my colleagues. Share your stories with us, it will be exciting.

“F[a-z][a-z][a-z]ing” proudly say, ” I am Admin” 🙂 🙂

Best Practices for The Linux and Application Admins
Tagged on:             

Leave a Reply

Your email address will not be published. Required fields are marked *