Why do servers fail




















Selected subsystems within each WebLogic Server instance monitor their health status based on criteria specific to the subsystem. For example, the JMS subsystem monitors the condition of the JMS thread pool while the core server subsystem monitors default and user-defined execute queue statistics.

If an individual subsystem determines that it can no longer operate in a consistent and reliable manner, it registers its health state as "failed" with the host server. Each WebLogic Server instance, in turn, checks the health state of its registered subsystems to determine its overall viability. Using Node Manager, server self-health monitoring enables you to automatically reboot servers that have failed.

This improves the overall reliability of a domain, and requires no direct intervention from an administrator. A clustered server that is configured to be migratable can be moved in its entirety from one machine to another, at the command of an administrator, or automatically, in the event of failure. The migration process makes all of the services running on the server instance available on a different machine, but not the state information for the singleton services that were running at the time of failure.

WebLogic Server supports migration of a individual singleton service as well as the server-level migration capability described in the previous section. Singleton services are services that run in a cluster but must run on only a single instance at any given time, such as JMS and the JTA transaction recovery system.

An administrator can migrate a JMS server or the JTS transaction recovery from one server instance to another in a cluster, either in response to a server failure or as part of regularly-scheduled maintenance. This capability improves the availability of pinned services in a cluster, because those services can be quickly restarted on a redundant server should the host server fail. Managed Servers maintain a local copy of the domain configuration. When a Managed Server starts, it contacts its Administration Server to retrieve any changes to the domain configuration that were made since the Managed Server was last shut down.

If a Managed Server cannot connect to the Administration Server during startup, it can use its locally cached configuration information—this is the configuration that was current at the time of the Managed Server's most recent shutdown.

By default, MSI mode is enabled. Recovery from the failure of a server instance requires access to the domain's configuration and security data. This section describes file backups that WebLogic Server performs automatically, and recommended backup procedures that an administrator should perform. The WebLogic Security service stores its configuration data in the config.

Back up the config directory to a secure location in case a failure of the Administration Server renders the original copy unavailable. If an Administration Server fails, you can copy the backup version to a different machine and restart the Administration Server on the new machine. Each time a Managed Server starts up, it contacts the Administration Server and if there are changes in to the domain configuration, the Managed Server updates its local copy of the domain config directory.

So, each Managed Server always has an current copy of its configuration data cached locally. If any of your security realms use these installed providers, you should maintain an up-to-date backup of the following directory tree:.

WebLogic security providers cannot modify security data while the domain's Administration Server is unavailable. The files in this directory contain user, group, group membership, policies, and role information. Do not update the configuration of a security provider while a backup of LDAP data is in progress.

This can be incredibly detrimental in the long run. All servers should be the right size for the data needs of the business they are sustaining. The more data loaded onto a server, and the more stress placed on it by multiple users and applications, the shorter its lifespan may be. If your server is too small to keep up with your data demands, it can slowly lose functionality and eventually fail.

The resulting monetary losses may be greater from server failure than purchasing the correct system or upgrades from the start. During this time there is no focus on the business objects diminishing productivity and increasing overhead cost. It is not a pleasant place to be.

So information is being backed up, but there is no confirmation to whether the data is even on the tape. We will save recovery for another date. Whether you have one or servers, over time they will fail. If, for example, the failed component was a fan and there were redundant fans in the server, then the server will continue to operate, and the administrator can replace the failed fan soon thereafter. Again, these component failures may not result in server failures, but they will require attention eventually.

Most x86 servers come with a standard three-year warranty. Support in years four and five may be expensive, but customers can consider self-support or third-party maintenance to reduce those costs. After five years, replacement parts may become almost impossible to obtain. Based on these expected x86 server hardware failure rates and maintenance and part replacement, enterprises should aim to replace their x86 servers approximately 48 to 60 months after installation.

A firm plan should be in place to replace x86 services no later than the end of the fifth year. I call this a slow death because heat slowly damages the internal circuitry. It is important to have controlled cooled airflow around your computing environment.

It varies, but there are several rules of thumb:. Like a human being, a server might be in trouble when it starts running a fever. Therefore, you should check the CPU, chipset, and HDD temperatures, and check whether or not your fans are running properly. Other possible causes of high temperature could include a clogged front intake, blockage of the exhaust or airflow, recent repositioning of the machine, or a dirty heat sink. Note: to figure out of your server is running too hot, you can probably check with your vendor for baselines; many models come with acceptable temperature operating specifications.

Such failures in isolation are usually nothing to worry about. But a mysterious crash for no clear reason, on a server with no intensive process running on it?

Cause for concern. With a server though, sudden slowness is often the result of deep-seated problems that could put it at risk for failure. For example, a process may cause a memory leak that could eat up all of your system resources, which could result in the system grinding to a halt. A simple software update might fix things in these instances, but your system may crash for other reasons.

For example, your Linux server might decide to go read-only if your hard drive is acting up. Or data corruption might be causing applications to randomly fail. Strange noises for HDDs are also a warning sign, much like the hypothetical noisy furnace we mentioned earlier.

A network monitoring solution like Spiceworks gives you the power to keep tabs on servers within minutes of setup. Look no further than Spiceworks: Catch problems before your users do. Get a real-time interface that lets you stay on top of servers, devices, and the health of your network.



0コメント

  • 1000 / 1000