Archives ||  About Us ||  Advertise ||  Feedback ||  Subscribe-
[an error occurred while processing this directive]
 Home > Focus
 Print Friendly Page ||  Email this story
> Focus: Enterprise Storage
Storage by Data

Which comes first: data or storage? A better way to understand this is by considering what data types your enterprise is using and then implement systems based on usage pattern. by Graeme K Le Roux

Planning for a storage architecture that can cope with today's escalating data demands is always tricky.

A good starting ground is to go back to basics by considering what data types your enterprise is using and then implement systems based on that usage pattern.

For most enterprises, data types fall into four categories. They are:
1. Application and operating system files
2. Temporary files
3. Raw data files
4. Data files controlled by an application.

There are also four basic types of online storage. They are simple local disks, dedicated arrays, shared arrays and network-attached storage (NAS). And there are also (you guessed it) four basic modes in which storage can operate: online, near online, backup and archive.

Archiving formats

Archives are generally stored off-site. Note that an archived data set may overlap a near online data set. However, one clear difference between the two-actually, between archive and all other data types—is the format issue. This is important because an archive is useless if you can't read it with current software.

For example all common word processors can read rich text format and have been able to for the last ten years, but all those word processing packages have changed their native formats in that time and thus older data in native format may not be currently readable. Hence, if you are keeping records in word processing formats, rich text format would be a better choice for an archive data set. Note also that you may have to re-save your documents in a designated archive format when stowing them away for archival—and that may cost time and money.

Having sorted out the archive format, you next have to pick a storage type. This is usually simple, and the following rule of thumb is a common sense guide:

1. Tape for backup
2. Tape, CD-RW or DVD-RW (depending upon your data volume) for near-online storage
3. CD/DVD-ROM for archives.

The simplest way to decide what type of storage you need is to start with the modes of storage. As a rule of thumb, any data which has not been accessed for more than six months should not be in online storage, because online storage is the most expensive type.

Online storage also has to be backed up and the more data you have to back up, the longer the backup takes to do or to restore. The bottom-line is that online storage is expensive to maintain, so make sure you are stashing only relevant data there.

For data backup, there is a simple rule of thumb for system administrators: backup the state of your system, and not just the data. This means that you need to backup things like Windows registry settings, access lists and databases, etc., not just the company's general ledger, word processing files, and so on.

Ideally, enterprises should be able to restore a crashed system on completely new hardware without much scrambling around for secondary backup disks and the like.

What constitutes near-online data? A good working definition is data which is less than two years old, has not been accessed for more than six months, but is important enough that the infrequent accesses must be quick.

By that definition, near-online data examples include customers' account statements and correspondence in current matters. Near-online storage is generally read only and not backed up daily. They don't usually reside in hard disks. For these kinds of data, CD-Rs are ideal for storage.

The last of the four storage category, archive, is often confused with backup. The differences between the two are that an archive store will last on the shelf, untouched, for years while a backup may degrade with time and archived data is not updated frequently. A typical medium for a backup is tape while archived data is frequently stored on read-only media like CD-ROM.

The latter is also more durable, since you can leave the CD-ROM on a shelf in a store room for years and it will still be readable while a tape left for the same length of time will suffer from media degradation.

As a guide, you can designate archive data as data which does not belong to your company's current financial year, is unchangeable and is to be kept for a long period of time. These are likely to be company financial information that companies are legally required to keep for some years.


For information on SANs and high speed networking, check out the following websites:

  • Fibre Channel Industry Association at (in US), and (in Europe)
  • The 10 Gigabit Ethernet Alliance at
  • The Storage Networking Industry Association at
  • The SCSI Trade Association at

Storage strategy
Besides their length of relevancy and usage patterns, data should also be categorized based on the application types.

Application and operating system files are almost de facto online, and often necessarily, stored on a simple local hard disk—that is local to the platform which will run them. For application files, this may mean the PC, or the application server (for thin client architecture).

Operating systems are typically locally-stored, although some environments allow you to share an OS disk between several hosts. But from personal experience, this form of OS-sharing is highly risky and often very messy in the event of a crash.

And don't forget temporary files. These are typically associated with the operating system and applications on a local hard drive, except when dealing with huge temporary files. For example, print files and data exported from one application for import into another during a batch process can be a real storage hog. In this case, it may be worthwhile considering using an external hard drive to store temp files.

Or if you deal with, say, export/import operations of live stock feeds from a mainframe-based DBMS to a server-based SQL engine, you may even want to consider using NAS. In this case, consider attaching NAS devices to a dumb switch which is also connected directly, and exclusively to the relevant hosts, via a dedicated NIC.

NAS is also an excellent choice for storing raw data files, such things as word processing documents, spreadsheets, etc. Whether this information is accessed directly from a user's desktop via a standard PC or indirectly via a thin client and application server, NAS devices offer the best mix of cost-effectiveness and responsiveness—due to the lack of a full blown OS in most NAS devices. Again, using high speed, dumb, Layer 2 switches is the way to go.

Sharing from front

It is one thing to construct a large data set, it is quite another to provide real-time access to it to a large number of concurrent users. The reason for this is simple; any host has a finite throughput. We get around this limitation in a SAN by arranging things so that multiple hosts can concurrently share the same data set. Unfortunately, this simply shifts the problem; we now have as much processing throughput as we want but we have to provide access to this processing power to our users.

The key here is the use of load balancing switches. A load balancing switch works by presenting a single IP address and port set (in this case, those ports that are necessary to service Web requests and responses). As far as the routers—and therefore any user beyond them—are concerned, there is one Web server at one Web address. The switch transparently shares requests for Web pages between the actual Web servers. In turn, each of the Web server sees a single DBMS server at a single IP address courtesy of a second load balancing switch. In this example, network links are "Ethernet".

All the Web servers and all the DBMS servers share a common file set on their respective disk farms. Note that the backup server attached to the DBMS disk farm is invisible to the rest of the system. Disk farms and servers are interconnected via Fiber Channel links.

While the load balancing switches are shown as single boxes, in reality they are likely to be a switch farm with failover capability to prevent them becoming a single point of failure.

Similarly one would expect that the routers would be connected to independent links to the Internet—say a mix of terrestrial, satellite and microwave services.

Beyond NAS lies SAN, which bypasses the network access arbitration and latency issues inherent in NAS by proving a high-speed storage access that is independent of the local network.

In simple terms, a SAN lets its networked hosts access their relevant file sets as if they were sitting on a local hard disk. SANs work at the level of a disk controller, typically through a fiber channel or SCSI adapter. This approach allows you to attach thousands of hosts and disks such that all the hosts see a single local disk set.

You also have to use a SAN when you are dealing with data files controlled by an application—especially when you need to have multiple instances of an application dealing with the one live file set concurrently.

Disk arrays in a SAN fabric can be shadowed and the individual shadow sets switched on- or off-line at will. Disk sets can also be backed up by dedicated hosts at any time, irrespective of the instantaneous system load. And with Fiber Channel, you can put the disks at the end of a 10 km fiber cable in a safe fire/flood/earthquake/bomb proof vault.

If your data centre is destroyed, all you need to do is connect new fiber cables to new hosts and you are back in business.

Graeme K. Le Roux is the director of Morsedawn (Australia), a company which specialises in network design and consultancy and writes for Network Computing-Asian Edition.

- <Back to Top>-  

Copyright 2001: Indian Express Group (Mumbai, India). All rights reserved throughout the world. This entire site is compiled in Mumbai by The Business Publications Division of the Indian Express Group of Newspapers. Site managed by BPD