תפריט נגישות

Oracle Application High Availability

המאמר דן בנושאים הקשורים לזמינות גבוהה של יישומי אורקל. (המאמר בשפה האנגלית)

Many enterprises use their information technology infrastructure to provide a competitive advantage, increase productivity, and empower users to make faster and more informed decisions. However, with these benefits comes an increasing dependence on that infrastructure. If a critical application becomes unavailable, then the entire business can be in jeopardy. The business can lose customers and revenue, incur penalties, or suffer bad publicity that adversely affects its stock price and customer base. Therefore it is critical to examine how your data is protected and to maximize its availability to your users.

What is High Availability

Availability is the degree to which an application, service, or feature is available upon user demand. Availability is measured by the perception of the application user. Application users experience frustration when their data is unavailable, and they do not understand or care to differentiate between the complex components of an overall solution.

Importance of Availability

The importance of high availability varies among applications. However, the need to deliver increasing levels of availability continues to accelerate as enterprises reengineer their solutions to gain competitive advantage. Most often, these new solutions rely on immediate access to critical business data. When data is not available, the operation can cease to function. Downtime can lead to lost productivity, lost revenue, damaged customer relationships, bad publicity, and lawsuits. Revenue losses and legal penalties incur because service level agreement objectives are not met.

Other factors to consider in the cost of downtime are the maximum tolerable duration of a single unplanned outage, and the maximum frequency of allowable incidents. If the event lasts less than 30 seconds, then it may have a very little effect and may be barely perceptible to users. As the duration of the outage grows, the effect may grow exponentially and negatively affect the business. When designing a solution, it is important to take into account these issues. An organization should weigh the true cost of downtime and balance it with the expected availability improvement.

Causes of Downtime

One of the challenges in designing a high availability solution is examining and

addressing all possible causes of downtime. It is important to consider causes of both

unplanned and planned downtime when designing a fault-tolerant and resilient IT

Infrastructure. Planned downtime can be just as disruptive to operations as unplanned

Downtime, especially in global enterprises that support users in multiple time zones.

 

Unplanned downtime is primarily the result of computer failures or data failures. Planned downtime is primarily due to data changes or system changes that must be applied to the production system. In the following sections we will in turn look at each of these four causes of downtime and examine the technology you can apply to avoid them.

PROTECTING AGAINST COMPUTER FAILURES
A computer failure occurs when the computer system or database server unexpectedly fails and causes a service interruption. In most cases this is due to hardware breakdown. These type failures are best remedied by taking advantage of fast database crash recovery and cluster technology.

Real Application Clusters (RAC) enables the enterprise to build database servers across multiple systems that are highly available and highly scalable. In a Real Application Clusters environment Oracle Database runs on two or more systems in a cluster while concurrently accessing a single shared database.

 

 

BOUNDING DATABASE CRASH RECOVERY
One of the most common causes of unplanned downtime is a system fault or crash. System faults are the result of hardware failures, power failures, and operating system or server crashes. The amount of disruption these failures cause will depend upon the number of affected users, and how quickly service is restored. High availability systems are designed to quickly and automatically recover from failures, should they occur. Users of critical systems look to the IT organization for a commitment that recovery from a failure will be fast and will take a predictable amount of time. Periods of downtime longer than this commitment can have direct effects on operations, and lead to lost revenue and productivity. The Oracle Database provides very fast recovery from system faults and crashes. However, equally important to being fast is being predictable. The Fast-Start Fault Recovery Fault technology included in the Oracle Database automatically bounds database crash recovery time and is unique to the Oracle Database. The database will self-tune checkpoint processing to safeguard the desired recovery time objective. This makes recovery time fast and predictable, and improves the ability to meet service level objectives. Oracle's Fast-Start Fault Recovery can reduce recovery time on a heavily loaded database from tens of minutes to less than 10 seconds.

Real Application Clusters gives users the flexibility to add nodes to the cluster as the demands for capacity increases, scaling the system up incrementally to save costs and eliminating the need to replace smaller single node systems with larger ones. Grid pools of standard low cost computers and modular disk arrays make this solution even more powerful with the Oracle Database. It makes the capacity upgrade process much easier and faster since one or more nodes can be added to the cluster, compared to replacing existing systems with new and larger nodes to upgrade systems. The Cache Fusion technology implemented in Real Application Clusters and the InfiniBand support provided in the Oracle Database 10g enables capacity to be scaled near linearly without making any changes to your application.

 PROTECTING AGAINST DATA FAILURES:

The causes of data failure are more complex and subtle than computer failure and can be caused by a failure of the storage hardware, human error, corruption, or site failure.

 

 It is extremely important to design a solution to protect against and recover from data failures. A system or network fault may prevent users from accessing data, but data failures without proper backups or recovery technology can result in a recovery taking many hours to perform, or lost data.

 PROTECTING AGAINST STORAGE FAILURES

With the new Automatic Storage Management (ASM) feature of the Oracle Database. ASM provides a vertically integrated file system and volume manager directly in the Oracle kernel, resulting in much less work to provision database storage, with a higher level of availability, without the expense, installation and maintenance of specialized storage products, and provides unique capabilities for database applications. ASM spreads its files across all available storage for optimal performance, and it can mirror as well, providing protection against data loss.

ASM extends the concept of SAME (stripe and mirror everything) and adds more flexibility in that it can do mirroring at the database file level instead of having to mirror at the entire disk level. But more importantly, ASM eliminates the complexity associated with managing data and disks; it vastly simplifies the processes of setting up mirroring, adding disks, and removing disks. Rather than managing hundreds, possibly thousands of files (as in a large data warehouse) DBA's using ASM create and administer a larger-grained object, the disk group, which identifies the set of disks that will be managed as a logical unit. The automation of the file naming and placement of the underlying database files save the DBA's time and ensures best practice standards are followed. ASM's native mirroring mechanism is an option that is used to protect against storage failures.

One1Up has developed programs to solve the issue of storage failures using the snapshot, snap mirror and snap clone capabilities. With snap clone or snap mirror an incremental (only changes are recorded) backup is taken every 5 minutes to a different location.

PROTECTING AGAINST HUMAN ERRORS

Almost any research done on the causes of downtime identifies human error as the single largest cause of downtime. Human errors like: the inadvertent deletion of important data; or when an incorrect WHERE clause in an UPDATE statement updates many more rows than were intended; need to be prevented wherever possible, and undone when the precautions against them fail. The Oracle Database provides easy to use yet powerful tools that help administrators quickly diagnose and recover from these errors, should they occur. It also includes features that allow end-users to recover from problems without administrator involvement, reducing the support burden on the DBA, and speeding recovery of the lost and damaged data.

GUARDING AGAINST HUMAN ERRORS
The best way to prevent errors is to restrict a user's access to data and services they truly need to conduct their business. The Oracle Database provides a wide range of security tools to control user access to application data by authenticating users and then allowing administrators to grant users only those privileges required to perform their duties. In addition the security model of Oracle Database provides the ability to restrict data access at a row level, using the Virtual Private Database feature, further isolating users from data they are not allowed to see.

 ORACLE FLASHBACK TECHNOLOGY:
When authorized people make mistakes you need the tools to correct these errors. The Oracle Database 10g provides a family of human error correction technology called Flashback. Flashback revolutionizes data recovery. In the past it might take minutes to damage a database but hours to recover it. With Flashback the time to correct errors equals the time it took to make the error. It is also extremely easy to use and a single short command can be used to recover the entire database instead of following some complex procedure. Flashback provides a SQL interface to quickly analyze and repair human errors. Flashback provides fine-grained surgical analysis and repair for localized damage -- like when the wrong customer order is deleted. Flashback also allows for correction of more widespread damage yet does it quickly to avoid long downtime -- like when all of this month's customer orders have been deleted. Flashback is unique to the Oracle Database and supports recovery at all levels including the row, transaction, table, table space, and database wide.

  • FLASHBACK QUERY. Oracle Flashback Query, a feature introduced in the Oracle9i Database allows an administrator or user to query any data at some point-in-time in the past. This powerful feature can be used to view and reconstruct lost data that may have been deleted or changed by accident. Developers can use this feature to build self-service error correction into their applications, empowering end-users to undo and correct their errors without delay, rather than burdening administrators to perform this task. Flashback Query is extremely simple to manage, as the database automatically keeps the necessary information to reconstruct data for a configurable time into the past.
  • FLASHBACK VERSIONS QUERY. The Flashback Versions Query provides a way to view changes made to the database at the row level. It is an extension to SQL and allows the retrieval of all the different versions of a row across a specified time interval. Using this query, a DBA can pinpoint when and how data is changed and trace it back to the user, application, or transaction. This allows the DBA to track down the source of a logical corruption in the database and correct it. It also enables the application developer to debug their code.
  • FLASHBACK TRANSACTION QUERY. Flashback Transaction Query provides a way to view changes made to the database at the transaction level. It is an extension to SQL that allows you to see all changes made by a transaction. In addition, compensating SQL statements are returned and can be used to undo changes made to all rows by this transaction. Again using a precision tool like this the DBA and application developer can precisely diagnose and correct logical problems in the database or application.
  • FLASHBACK DATABASE. To bring an Oracle database to a previous point in time, the traditional method is to do point in time recovery. However, point-in-time recovery can take hours, or even days, since it requires the whole database to be restored from backup and recovered to the point in time just before the error was introduced into the database. With the size of databases constantly growing, it will take hours or even days just to restore the whole database. Flashback Database is a new strategy for doing point in time recovery. It quickly rewinds an Oracle database to a previous time to correct any problems caused by logical data corruption or user error.
  • FLASHBACK TABLE. Flashback Table provides a way to bring a table to a previous point in time. Flashback Table performs this operation online and in-place and it maintains any referential integrity constraints between the tables. Flashback Table is just like having a rewind or undo button for a table, or set of related tables.
  • FLASHBACK DROP. Dropping, or deleting, database objects by accident is a mistake people have, and probably always, will make. Flashback Drop provides a safety net when dropping objects in Oracle Database 10g. When a user drops a table, Oracle will place it in a Recycle Bin. Objects in the Recycle Bin will remain there until user decides to permanently remove them or the space pressure is placed on the table space containing the table. Flashback Drop is just like having an un-drop button for a table, and its dependent objects.

PROTECTING AGAINST DATA CORRUPTIONS:
A corruption is created by a faulty component in the IO stack. For example the database issues IOs as the result of an update transaction. The database IOs are passed to: the IO code in the operating system which passes it to the file system; which passes it to the volume manager; which passes it to the device driver; which passes it to the Host-Bus-Adapter; which passes it to the storage controller; which passes it to the disk drive to finally be written. Bugs or a hardware failure in any component in the IO stack could "flip some bits" in the data resulting in corrupt data being written to the database. This corruption could be to database control information or user data either of which could be catastrophic to the functioning and availability of the database. Similarly a disk failure could damage database files requiring backups be used to recover the database.

PROTECTING AGAINST SITE FAILURES
Data protection features provide protection from catastrophic events that cripple processing at a site for an extended period of time. Examples include file corruptions, natural disasters, power and communication outages, and even terrorism. The Oracle Database offers a variety of data protection solutions that provide the ability to create and maintain a local or remote copy of a production database. In the event of a corruption or disaster, users of the data can continue to function by accessing the remote database. The simplest form of data protection is off-site storage of database backups. In the event a data center is unable to resume services in a reasonable amount of time, the backups can be restored on a system at another site, and users can connect to the backup system. Unfortunately, restoring backups on another system will be time consuming, and the backup may not be completely up-to-date. To more quickly recover and maintain continuous database service even in the event of a disaster, Oracle provides Standby Database

A Standby Database is an exact copy of an operational database on a remote server, ready to be used for backup, disaster recovery, and analysis and reporting, to name a few applications. This copy of the database is being held up-to-date by applying the changes from the primary database to the standby database.

 For disaster recovery, in case the primary database fails (due to hardware or software failure), the users can be redirected to the standby database with very little downtime, less than 20 minutes for a well configured system.

By replicating your primary database to a standby database, you free the primary database from the heavy load involved in running large reports.

 

All our Solutions describe Database Failure. So the Solution of Application Failure is:  

   Application Availability depends on System Availability using OS Cluster or VERITAS Cluster or using vmware .

 

The Oracle Application includes important and revolutionary availability features to ensure your data and database are available whenever and wherever it is needed.

Each of these Solutions is practiced by One1up by The Best Specialists in Israel.

 CONCLUSION
As the key component of your IT infrastructure the Oracle Database provides the features and tools to ensure data access and availability for mission critical applications. The Oracle Business includes important and revolutionary availability features to ensure your data and database are available whenever and wherever it is needed.

 כותב המאמר:

איימן נטשה, DBA, מומחה טכנולוגיות  

ליצירת קשר: roman.mitshel@one1up.com

 

 

פרסום באתר