comes first: data or storage? A better way to understand
this is by considering what data types your enterprise
is using and then implement systems based on usage pattern.
by Graeme K Le Roux
for a storage architecture that can cope with today's
escalating data demands is always tricky.
A good starting ground is to go back to basics by considering
what data types your enterprise is using and then implement
systems based on that usage pattern.
For most enterprises, data types fall into four categories.
1. Application and operating system files
2. Temporary files
3. Raw data files
4. Data files controlled by an application.
There are also four basic types of online storage. They
are simple local disks, dedicated arrays, shared arrays
and network-attached storage (NAS). And there are also
(you guessed it) four basic modes in which storage can
operate: online, near online, backup and archive.
are generally stored off-site. Note that an archived
data set may overlap a near online data set. However,
one clear difference between the two-actually,
between archive and all other data typesis
the format issue. This is important because an
archive is useless if you can't read it with current
For example all common word processors can read
rich text format and have been able to for the
last ten years, but all those word processing
packages have changed their native formats in
that time and thus older data in native format
may not be currently readable. Hence, if you are
keeping records in word processing formats, rich
text format would be a better choice for an archive
data set. Note also that you may have to re-save
your documents in a designated archive format
when stowing them away for archivaland that
may cost time and money.
Having sorted out the archive format, you next
have to pick a storage type. This is usually simple,
and the following rule of thumb is a common sense
1. Tape for backup
2. Tape, CD-RW or DVD-RW (depending upon your
data volume) for near-online storage
3. CD/DVD-ROM for archives.
The simplest way to decide what type of storage
you need is to start with the modes of storage. As a
rule of thumb, any data which has not been accessed
for more than six months should not be in online storage,
because online storage is the most expensive type.
Online storage also has to be backed up and the more
data you have to back up, the longer the backup takes
to do or to restore. The bottom-line is that online
storage is expensive to maintain, so make sure you are
stashing only relevant data there.
data backup, there is a simple rule of thumb for system
administrators: backup the state of your system, and
not just the data. This means that you need to backup
things like Windows registry settings, access lists
and databases, etc., not just the company's general
ledger, word processing files, and so on.
Ideally, enterprises should be able to restore a crashed
system on completely new hardware without much scrambling
around for secondary backup disks and the like.
constitutes near-online data? A good working definition
is data which is less than two years old, has not been
accessed for more than six months, but is important
enough that the infrequent accesses must be quick.
By that definition, near-online data examples include
customers' account statements and correspondence in
current matters. Near-online storage is generally read
only and not backed up daily. They don't usually reside
in hard disks. For these kinds of data, CD-Rs are ideal
The last of the four storage category, archive, is often
confused with backup. The differences between the two
are that an archive store will last on the shelf, untouched,
for years while a backup may degrade with time and archived
data is not updated frequently. A typical medium for
a backup is tape while archived data is frequently stored
on read-only media like CD-ROM.
latter is also more durable, since you can leave the
CD-ROM on a shelf in a store room for years and it will
still be readable while a tape left for the same length
of time will suffer from media degradation.
As a guide, you can designate archive data as data which
does not belong to your company's current financial
year, is unchangeable and is to be kept for a long period
of time. These are likely to be company financial information
that companies are legally required to keep for some
information on SANs and high speed networking,
check out the following websites:
Fibre Channel Industry Association at www.fibrechannel.com
(in US), and www.fibrechannel-europe.com (in
The 10 Gigabit Ethernet Alliance at www.10gea.org
The Storage Networking Industry Association
The SCSI Trade Association at www.scsita.org
Besides their length of relevancy and usage patterns,
data should also be categorized based on the application
and operating system files are almost de facto online,
and often necessarily, stored on a simple local hard
diskthat is local to the platform which will run
them. For application files, this may mean the PC, or
the application server (for thin client architecture).
systems are typically locally-stored, although some
environments allow you to share an OS disk between several
hosts. But from personal experience, this form of OS-sharing
is highly risky and often very messy in the event of
And don't forget temporary files. These are typically
associated with the operating system and applications
on a local hard drive, except when dealing with huge
temporary files. For example, print files and data exported
from one application for import into another during
a batch process can be a real storage hog. In this case,
it may be worthwhile considering using an external hard
drive to store temp files.
Or if you deal with, say, export/import operations of
live stock feeds from a mainframe-based DBMS to a server-based
SQL engine, you may even want to consider using NAS.
In this case, consider attaching NAS devices to a dumb
switch which is also connected directly, and exclusively
to the relevant hosts, via a dedicated NIC.
NAS is also an excellent choice for storing raw data
files, such things as word processing documents, spreadsheets,
etc. Whether this information is accessed directly from
a user's desktop via a standard PC or indirectly via
a thin client and application server, NAS devices offer
the best mix of cost-effectiveness and responsivenessdue
to the lack of a full blown OS in most NAS devices.
Again, using high speed, dumb, Layer 2 switches is the
way to go.
is one thing to construct a large data set, it
is quite another to provide real-time access to
it to a large number of concurrent users. The
reason for this is simple; any host has a finite
throughput. We get around this limitation in a
SAN by arranging things so that multiple hosts
can concurrently share the same data set. Unfortunately,
this simply shifts the problem; we now have as
much processing throughput as we want but we have
to provide access to this processing power to
The key here is the use of load balancing switches.
A load balancing switch works by presenting a
single IP address and port set (in this case,
those ports that are necessary to service Web
requests and responses). As far as the routersand
therefore any user beyond themare concerned,
there is one Web server at one Web address. The
switch transparently shares requests for Web pages
between the actual Web servers. In turn, each
of the Web server sees a single DBMS server at
a single IP address courtesy of a second load
balancing switch. In this example, network links
All the Web servers and all the DBMS servers share
a common file set on their respective disk farms.
Note that the backup server attached to the DBMS
disk farm is invisible to the rest of the system.
Disk farms and servers are interconnected via
Fiber Channel links.
While the load balancing switches are shown as
single boxes, in reality they are likely to be
a switch farm with failover capability to prevent
them becoming a single point of failure.
Similarly one would expect that the routers would
be connected to independent links to the Internetsay
a mix of terrestrial, satellite and microwave
Beyond NAS lies SAN, which bypasses the network access
arbitration and latency issues inherent in NAS by proving
a high-speed storage access that is independent of the
In simple terms, a SAN lets its networked hosts access
their relevant file sets as if they were sitting on
a local hard disk. SANs work at the level of a disk
controller, typically through a fiber channel or SCSI
adapter. This approach allows you to attach thousands
of hosts and disks such that all the hosts see a single
local disk set.
You also have to use a SAN when you are dealing with
data files controlled by an applicationespecially
when you need to have multiple instances of an application
dealing with the one live file set concurrently.
Disk arrays in a SAN fabric can be shadowed and the
individual shadow sets switched on- or off-line at will.
Disk sets can also be backed up by dedicated hosts at
any time, irrespective of the instantaneous system load.
And with Fiber Channel, you can put the disks at the
end of a 10 km fiber cable in a safe fire/flood/earthquake/bomb
If your data centre is destroyed, all you need to do
is connect new fiber cables to new hosts and you are
back in business.
K. Le Roux is the director of Morsedawn (Australia),
a company which specialises in network design and consultancy
and writes for Network Computing-Asian Edition.