Is there a language protected from natural disasters?

When creating system services that should be highly reliable, I often end up writing down a lot of "fault-tolerant" mechanisms in the case of things like: messages that are gone (for example, communication with the database), what happens if the power is lost, and the service restarts ... how to pick up the pieces and continue correctly (and remembering that when collecting the parts the power could go out again ...), etc. etc.

I can imagine that for not too complex systems, a language that could satisfy this would be very practical. Thus, the language that will be remembered, it will indicate at any time, regardless of whether the power is turned off and continues, where it stopped.

Does it still exist? If so, where can I find it? If not, why is it impossible to implement? It seemed to me that this is very convenient for critical systems.

ps In case of loss of connection with the database, this will mean that a problem has occurred and manual intervention is required. The moment the connection is restored, it will continue where it stopped.

EDIT: Since the discussion seems to have passed away, let me add a few points (waiting before I can add generosity to the question)

Erlang's answer is apparently the most popular right now. I know Erlang and read the pragmatic book of Armstrong (the main creator). All this is very nice (although functional languages ​​make my head spin with all the recursion), but the "fault tolerant" bit does not come automatically. Not at all. Erlang offers many supervisors of other techniques to control the process and restarts it if necessary. However, in order to correctly do something that works with these frameworks, you must be a pretty erlang guru and have to make your software suitable for all of these frameworks. In addition, if the power drops, the programmer also needs to pick up the pieces and try to restore the next time the program restarts.

What I'm looking for is much simpler:

Imagine a language (as simple as PHP) where you can do things like database queries, act on it, perform file manipulations, perform folder manipulations, etc.

The main function, however, should be: if the power dies and the thing restarts, it takes the place where it stopped (therefore, it not only remembers where it was, but also remembers the states of the variables). Also, if it stopped in the middle of the file machine, it will also resume correctly. etc.

And last, but not least, if the connection to the database drops and cannot be restored, the language simply stops and signals (possibly syslog) for human intervention, and then continue where it was stopped.

A language like this will greatly facilitate the programming of services.

EDIT: It seems (judging by all the comments and answers) that such a system does not exist. And, probably, he won’t be in the near foreseeable future due to the fact that he (next to him) cannot be right.

Too bad .... again I am not looking for this language (or framework) to deliver me to the moon or use it to observe someone's heart. But for small periodic services / tasks that always end up with load on bordercases code processing (power failure somewhere in the middle, connections are reduced and not returned) ... where there is a pause here ... fix problems, ... . and continue where you left off will work well.

(or a checkpoint approach, as one of the commentators pointed out (for example, in a video game). Set a checkpoint .... and if the program dies, restart here next time.)

Award for reward: At the last possible minute, when everyone came to the conclusion that it was impossible to do, Stephen C comes up with napier88, which seems to have the attributes that I was looking for. Although it is an experimental language, it proves that it can be done, and it is something worth exploring more.

I will be looking to create my own framework (with a constant state and snapshots) to add the features I'm looking for in .Net or another virtual machine.

Thank you all for your contribution and great ideas.

+41
programming-languages language-design
Sep 10 '09 at 7:53
source share
28 answers

There is an experimental language called Napier88, which (in theory) has some signs that it is a disaster. The language supports Orthogonal Persistence, and in some implementations it expands (expands) to include the state of the entire calculation. In particular, when the Napier88 runtime system checks the running application for persistent storage, the current state of the stream will be included in the checkpoint. If the application then crashed and you restarted it correctly, you can resume the calculation from the checkpoint.

Unfortunately, there are a number of difficult problems that need to be resolved before the technology is ready to be used for the most part. These include identifying ways to support multithreading in the context of orthogonal persistence, figuring out how to allow multiple processes to share persistent storage, and scalable garbage collection from persistent storage.

And there is the problem of performing orthogonal resistance in the main language. There have been attempts to make OP in Java, including those made by people associated with Sun (the Pjama project), but at the moment there is nothing active. JDO / Hibernate approaches are more popular these days.




I must point out that Orthogonal Resilience is not really a disaster in the big sense. For example, it cannot deal with:

  • reconnection, etc. with "external" systems after a restart,
  • application errors that cause corruption of stored data, or
  • data loss due to something knocking the system between checkpoints.

For them, I do not believe that there are common solutions that would be practical.

+11
Sep 18 '09 at 23:55
source share
+59
Sep 10 '09 at 8:05
source share

Program Transactional Memory (STM) combined with non-volatile RAM is likely to satisfy the revised OP issue.

STM is a method of implementing "transactions", for example, sets of actions that are performed efficiently as an atomic operation or not performed at all. Typically, the goal of an STM is to enable highly parallel programs to interact over shared resources in a way that is easier to understand than traditional programming with locking this resource, and possibly reduces overhead due to the highly optimistic programming locking style.

The basic idea is simple: all records and records inside a transaction block are written (somehow!); if at the end of each of these transactions they conflict with any two threads (read / write or write-write conflicts) at the end of any of their transactions, one is selected as the winner and continues, and the other is forced to roll back its state before the transaction begins and re-execute.

If someone insisted that all calculations were transactions, and the state at the beginning (/ end) of each transaction was stored in non-volatile RAM (NVRAM), a power failure could be considered as a transaction failure, leading to a “rollback”, calculations will be carried out only from transactional states in a reliable way. NVRAM these days can be implemented using flash memory or a backup battery. It may take a lot of NVRAM because programs have many states (see Minicomputer Story at the end). Alternatively, committed state changes can be written to log files that were written to disk; This is the standard method used by most databases and reliable file systems.

The current question with STM is how expensive is it to keep track of potential transaction conflicts? If the implementation of STM slows down the machine by a significant amount, people will live with existing slightly unreliable schemes, and not abandon this performance. So far, the story is not very good, but research is early.

People did not develop languages ​​for STM at all; for research purposes, they are mostly advanced Java with STM (see post in ACM article in June this year). I heard that MS has an experimental version of C #. Intel has an experimental version for C and C ++. The wikipedia page has a long list. And functional programmers, as usual, argue that the free side effects of functional programs make STM relatively trivial to implement in functional languages.

If I remember correctly, back in the 70s there was significant early work in distributed operating systems in which processes (state of code +) could move trivially from machine to machine. I believe that several of these systems explicitly allow the node to crash and can restart the process in the failed node from the save state to another node. An early key work was Dave Farber's Distributed Computing System . Since language design was popular in the 70s, I recall that DCS had its own programming language, but I don’t remember the name. If DCS did not resolve the failure and reboot of the node, I am sure that this happened on research systems.

EDIT: The 1996 system that appears at a glance to possess the desired properties is documented here . His concept of atomic transactions is consistent with the ideas of STM. (Goes to prove that under the sun there is nothing new).

Note: Back in the 70s, Core Memory was still king. The core, being magnetic, was non-volatile in power, and many minicomputers (and I'm sure the mainframes) have power failure interruptions that notify the software a few milliseconds before losing power. Using this, you can easily save the state of the machine register and completely close it. When power is restored, control returns to the restore point, and the software can continue to work. Thus, many programs can withstand strong flashing and reliably restart. I personally created a time sharing system on the General General Nova mini-computer; you could actually launch 16 teletypes full blast, take a powerful blow, come back and restart all teletypes, as if nothing had happened. The change from cacophony to silence and back was overwhelming, I know, I had to repeat it many times to debug the power failure control code, and, of course, she did a great demonstration (pull out the plug, deadly silence, reconnect ...). The name of the language that it did was, of course, Assembler: -}

+33
Sep 13 '09 at 9:44
source share

From what I know¹, Ada is often used in mission-critical security systems.

Ada was originally aimed at embedded systems and real-time systems.

Known features of Ada include: strong typing, modular mechanisms (packages), runtime checking, parallel processing (tasks), exception processing and generics. Added Ada 95 support for object-oriented programming, including dynamic dispatch.

Ada supports run-time checks in order to protect against access to unallocated memory, buffer overflow errors, one-after-one errors, an array of access errors, and other detectable errors. These checks can be disabled in the interest of performance, but can often be compiled efficiently. It also includes a program check.

For these reasons, Ada is widely used in critical systems, where any anomaly can lead to very serious consequences, that is, accidental death or injury. Examples of systems in which Ada is used: avionics, weapon systems (including thermonuclear weapons) and spacecraft.

Programming the N version can also give you useful background reading.

¹What basically one acquaintance who writes embedded security-safe software

+13
Sep 10 '09 at 8:17
source share

I doubt that the language features that you describe can be achieved.

And the reason is that it would be very difficult to determine the general and general failure modes and how to restore them. Think for a second about your example application - on some website with some logic and database access. And let's assume that we have a language that can detect a power outage and subsequent restart, and somehow recover from it. The problem is that it is impossible to learn how to restore the language.

Say your application is an online blog application. In this case, this may be enough to simply continue from the moment we failed, and everything will be all right. However, consider a similar scenario for an online bank. Suddenly he is no longer smarter just to continue from the same point. For example, if I tried to withdraw money from my account, and the computer died immediately after verification, but before he completed the withdrawal and then he returned in a week, he would give me money, even if my account is now negative.

In other words, there is no single correct recovery strategy, so this is not something that can be implemented in the language. In what language can you say when something goes wrong, but most languages ​​already support this with exception handling mechanisms. The rest is up to application designers.

There are many technologies that enable the development of fault tolerant applications. Database transactions, long message queues, clustering, hardware hot swap, and so on and on. But it all depends on the specific requirements and how much the end user is willing to pay for all this.

+13
Sep 13 '09 at 11:48
source share

Most of these efforts, called " fault tolerance ," are hardware, not software.

An extreme example of this is Tandem , whose non-stop machines are completely redundant.

Implementing hardware-based fault tolerance is attractive because the software stack is usually made up of components sourced from different vendors — your highly available software can be installed next to some definitely shaky other applications and services on top of the operating system, it is rough and uses hardware device drivers which are clearly fragile.

But at the language level, almost all languages ​​offer tools for correct error checking. However, even with RAII, exceptions, restrictions, and transactions, these code codes are rarely checked correctly and rarely tested together in multiple failure scenarios, and usually in error handling code that hides errors. So this is more about understanding by the programmer, discipline and compromise than about the languages ​​themselves.

This brings us back to hardware resiliency. If you can avoid a crash in the database link, you can avoid using inconvenient error handling code in applications.

+10
10 Sep '09 at 8:13
source share

No , a language protected from natural disasters does not exist.

Edit:

Reliability means perfection. This is reminiscent of the images of a process that applies some intelligence to the logical solution of unknown, uncertain and unexpected conditions. There is no way that a programming language can do this. If you, as a programmer, cannot understand how your program will fail and how to repair it, then your program will also not be able to do this.

Poverty in terms of IT can arise in many models that no process can solve all these different problems. The idea that you could develop a language to solve all the ways in which something might go wrong is simply not true. Due to the abstraction from the hardware, many problems do not even make sense to handle the programming language; but they are still "disasters."

Of course, once you begin to limit the scope of the problem; then we can start talking about developing a solution for him. Therefore, when we stop talking about being poor and talking about recovering from unexpected power surges, it is much easier to develop a programming language to solve this problem, even if it may not make sense to consider this problem at such a high stack level. However, I dare to predict that as soon as you cover this with realistic implementations, it becomes uninteresting as a language, since it has become so specific. for example, use my scripting language to start batch processes overnight, which will recover after unexpected power surges and lost network connections (with some help from a person); this is not a convincing argument in favor of my mind.

Please do not get me wrong. There are some wonderful suggestions on this topic, but, in my opinion, they do not come up with anything, even remotely approaching the evidence of security.

+10
Sep 14 '09 at 14:56
source share

Consider a system created from non-volatile memory. The program state is saved at any time, and if the processor stops for a while, it will resume at the moment when it will be restarted. Thus, your program is “evidence of natural disasters” to the extent that it can survive a power failure.

This is entirely possible, since other messages were presented when talking about software transactional memory and "fault tolerance," etc. It is curious that no one mentioned the “memristors", as they proposed future architecture with these properties and, possibly, von Neumann architecture completely.

Now imagine a system built of two such discrete systems - for a simple illustration, one of them is a database server, and the other is an application server for an online banking website.

If one pause, what does the other? How does he cope with the sudden inaccessibility of this employee?

It could be handled at the language level, but that would mean a lot of error handling, etc., and that complex code would become correct. This is almost no better than today, when machines are not tested, but languages ​​try to detect problems and ask the programmer to deal with them.

It can also stop - at the hardware level, they can be connected to each other, so from the point of view of power they represent one system. But this is hardly a good idea; higher availability will come from a fail-safe architecture with backup systems, etc.

Or we could use constant message queues between two machines. However, at some point, these messages are processed, and at this moment they may be too old! Only the application logic can really work on what to do in these circumstances, and we are back to languages ​​delegating the programmer again.

Thus, it seems that troubleshooting in the current form is better - uninterruptible power supplies, ready-to-use backup servers, numerous network routes between hosts, etc. And then we can only hope that our software is free!

+4
Sep 15 '09 at 11:14
source share

The exact answer:

Ada and SPARK were designed to maximize fault tolerance and to move all errors that can be used for compilation rather than runtime. Ada was developed by the U.S. Department of Defense for military and aviation systems that run on embedded devices in things like airplanes. The spark is her descendant. There was another language used in the early US space program, HAL / S, designed to handle hardware failure and memory damage due to cosmic rays.




The practical answer:

I have never met anyone who could write Ada / Spark. For most users, the best answer is SQL options in a DBMS with automatic fault tolerance and server clustering. Integrity check guarantees security. Something like T-SQL or PL / SQL has complete transaction security, is the completion of Turing, and quite tolerant of problems.




The reason there is no better answer:

For performance reasons, you cannot provide longevity for every program operation. If you did, processing would slow down to the speed of your fastest memoryless storage. In the best case, your performance will decrease by a thousand or a million times, due to how much ANYTHING slower than processor or RAM caches is.

That would be the equivalent of moving from a Core 2 Duo CPU to an ancient 8086 processor - in most cases, you could do a couple of hundred operations per second. In addition, it will even be SLOWER.

In cases where there are frequent power cyclic or hardware failures, you use something like a DBMS, which guarantees an ACID for every important operation. Or you use hardware with fast non-volatile storage (like flash) - it's still a lot slower, but if the processing is simple, that's fine.

At best, your language gives you good compile-time security checks for errors, and will throw exceptions, not crashes. Exception handling is a function that consists of half the languages ​​used.

+3
Sep 15 '09 at 17:19
source share

This question made me post this text

(He is quoted from HGTTG ​​from Douglas Adams :)




Click, hum.

A huge gray reconnaissance ship Grebulon quietly walked through the black void. He traveled at a fabulous, breathtaking speed, but still appeared, amid a flickering billion of distant stars that did not move at all. It was just one dark spot, frozen from the endless granularity of a bright night.

On board the ship, everything was as it was for millennia, deeply dark and quiet.

Click, hum.

At least almost everything.

Click, click, hum.

Click, hum, click, hum, click, hum.

Click, click, click, click, click, click.

Hmmm.

The low-level control program woke up the higher-level control program in the ship’s crazy cyber brain and informed him that whenever he clicked, all he got was a buzz.

The top-level surveillance program asked him what she should have received, and the low-level level control program said she couldn’t remember exactly, but thought that this was most likely a kind of distant contented sigh, right ?? He did not know what kind of hum. Click, hum, click, hum. That is all he got.

The top-level surveillance program considered this and did not like it. He asked the low-level control program what exactly it controlled, and the low-level control program said that she also couldn’t remember this, it was just that it was necessary to press, sigh every ten years or so, what usually happened without fail. He tried to consult his error search table, but could not find it, so he warned that a higher-level control program was suitable for this problem.

The tertiary education program consulted with one of its own reference tables to find out what the low-level supervisory program should have controlled.

Could not find lookup table.

Odd.

He looked again. All he got was an error message. He tried to find the error message in his error message table and could not find it. This allowed several nanoseconds to pass, until all this repeated. Then he woke up over his leader in an industry function.

An industry function supervisor was struck by immediate problems. He called him the supervisor, who also had problems. For several millionths of a second virtual circuits that lay dormant, some for years, some for centuries, flashed in life on the whole ship. Something, somewhere, was terribly wrong, but none of the supervisory programs could say what it was. At each level, there were no important instructions, and there were no instructions on what to do if important life instructions were discovered.

Small software modules - agents - moved along logical paths, grouping, consulting, regrouping. They quickly established that the memory of the ship, right up to the central module of the mission, was in rags. No survey could determine what happened. Even the central mission module itself was damaged.

This made the whole problem very simple. Replace the central mission module. There was another, backup, exact duplicate of the original. This had to be physically replaced because, for security reasons, there was no connection between the original and its backup. Once the central module of the mission was replaced, he himself could control the rest of the system in every detail, and everything would be fine.

The robots were instructed to bring the backup central mission module from the protected strong room where they were guarding it to the ship’s logical chamber for installation.

This is due to the long exchange of codes and emergency protocols, as robots interrogated agents regarding the authenticity of the instructions. Finally, the robots were satisfied that all procedures were correct. They unpacked the backup central mission module from their storage, pulled it out of the storage chamber, dropped out of the ship and went into the void.

This provided the first important clue to what it was, what was wrong.

Further research quickly established what happened. The meteorite knocked out a large hole on the ship. The ship had not previously discovered this, because the meteorite accurately knocked out that piece of equipment for processing the ship, which was to determine whether the ship was hit by a meteorite.

The first thing to do is try to seal the hole. This turned out to be impossible because ship sensors could not see that there was a hole, and managers who were supposed to say that the sensors were not working properly did not work properly and continued to say that the sensors were in order. The ship could only deduce the existence of the hole due to the fact that the robots clearly fell out of it, taking away a spare brain, which would allow him to see the hole with them.

The ship tried to reasonably think about it, failed, and then completely plunged. Of course, he did not understand that it was fading because it was fading. Just surprised that the stars are jumping. After the stars jumped onto the ship for the third time, they finally realized that it should go out and that it was time to make some serious decisions.

He relaxed.

Then he realized that he had not yet made serious decisions and did not panic. He went out a little. When he woke again, he secured all the bulkheads around the place where he knew that there should be an invisible hole.

He obviously didn’t get to his destination, he thought, but since he no longer had the slightest idea where his purpose was and how to reach it, there seemed to be little point in continuing. He consulted on what tiny fragments of instructions he could recover from the broken line of his central mission module.

"Yours !!!!!!!!!!!!!!!! year mission - !!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!!!!!!!!!, land !!!!!!!!!!!!!!!! safe distance !!!!!!!!!!! .......... ........., ground ............... monitor. !!!!!!!!!!!!!!!! ... "

Everything else was complete rubbish.

Before it is closed forever, the ship will have to transmit these instructions, for example, to its more primitive auxiliary systems.

He must also revive his entire crew.

Another problem has arisen. While the crew was in hibernation, the minds of all its members, their memories, their personality and understanding of what they did, were all transferred to the central module of the ship’s mission for safe storage. The crew would not have the slightest idea of ​​who they were or what they were doing there. Well.

Shortly before he went out for the last time, the ship realized that his engines also began to give out.

The ship and its revived and confused crew went under the control of their auxiliary automatic systems, which simply looked at the ground, wherever they were, to land and monitor what they could find for observation.

+3
Sep 18 '09 at 19:41
source share

Try using an existing open source language and see if you can adapt its implementation to include some of these features. The default implementation of Python for Python includes a built-in lock (called GIL, Global Interpreter Lock), which is used to "handle" concurrency among Python threads by alternating each VM instruction "n". Perhaps you can connect to the same mechanism to check the status of the code.

+2
Sep 10 '09 at 8:10
source share

For the program to continue working when it was turned off, if the machine lost power, not only had to save the state somewhere, the OS also had to “know” in order to resume it.

I suppose that the implementation of the "sleep mode" function in the language can be performed, but if it happens all the time in the background, so it is ready in case something bad sounds like an OS, in my opinion.

+2
Sep 12 '09 at 14:57
source share

The main function, however, should be: if the power dies and the thing restarts, it takes the place where it stopped (therefore, it not only remembers where it was, but also remembers the states of the variables). Also, if it stopped in the middle of the file machine, it will also resume correctly. etc.

......

I have looked at erlang in the past. No matter how tolerant she is, he has ... He does not survive. When the code restarts, you have to pick up the pieces

If such a technology existed, it would be very interesting for me to read about it. However, the Erlang solution will have several nodes - ideally in different places - so if one place goes down, the other nodes can raise the slack. If all your nodes were in the same place and on the same power source (not a good idea for distributed systems), then you are out of luck, as you mentioned in the comments.

+2
Sep 12 '09 at 15:08
source share

The Microsoft Robotics team introduced a set of libraries that seem to be applicable to your question.

What is Concurrency and Runtime Coordination (CCR)?

Concurrency and coordination lead time (CCR) provides a highly competitive programming model based on message passing using powerful orchestration primitives that allow data and work coordination without the use of manual threading, locks, semaphores, etc. CCR addresses the need for multi-core and parallel applications by providing a programming model that facilitates managing asynchronous operations, dealing with concurrency, using parallel hardware and partial failure processing.

What is Decentralized Software Services (DSS)?

Decentralized Software Services (DSS) provides a lightweight, state-oriented service model that combines representative state transfer (REST) ​​with a formalized composition and event notification architecture allowing for a systematic approach to building applications. In DSS, services are disclosed as resources that are accessible both programmatically and for user interface manipulation. By integrating a service composition, a structured state of manipulation and event notification with data isolation, DSS provides a single model for writing highly observable, loosely coupled applications running on a single node or over a network.

Most of the answers are general purpose languages. You might want to learn more specialized languages ​​that are used in embedded devices. A robot is a good example for thought. What do you want and / or expect from the robot when it recovers after a power failure?

+2
Sep 13 '09 at 4:21
source share

In the embedded world, this can be achieved by interrupting the watchdog timer and battery-powered RAM. I wrote this.

+2
Sep 13 '09 at 4:34
source share

Depending on your definition of disaster, it can range from “difficult” to “practically impossible” to delegate this responsibility to this language.

Other examples provided include saving the current state of the application in NVRAM after each statement is executed. This only works if the computer is not destroyed.

How does a language level function know to restart an application on a new host?

And in the situation of restoring the application to the host - what if a considerable time has passed, and previously made assumptions / checks were now invalid?

T-SQL, PL / SQL, and other transactional languages ​​are probably as close as you get to the "proof of disaster" - they either succeed (and the data is saved) or not. Excluding disabling transactional isolation, it is difficult (but probably not impossible, if you are really trying hard) to get into "unknown" states.

You can use methods such as SQL mirroring to ensure that records are stored in at least two places at the same time until the transaction is completed.

You still need to ensure that you maintain your state every time it is safe (commit).

+2
Sep 15 '09 at 11:38
source share

If I understand your question correctly, I think you are asking if it is possible to guarantee that a particular algorithm (that is, a program plus any recovery options provided by the environment) will be completed (after any arbitrary number of restores / restarts).

If this is correct, I would like to address the stop problem :

Based on the description of the program and the final input, decide whether the program will end or run forever, given this input.

I think that classifying your question as an example of a stopping problem is fair, given that you would ideally want the language to be “evidence of a natural disaster,” that is, it will “perfect” any inadequate program or chaotic environment.

This classification reduces any combination of environment, language, and program to "program and final input."

If you agree with me, you will be disappointed that the problem with the stop is unsolvable. Thus, it cannot be proved that a language or compiler or environment without "distress" can be proved.

However, it is prudent to develop a language that provides recovery options for various common problems.

+2
Sep 15 '09 at 20:13
source share

In the event of a power failure .. sounds like to me: "When your only tool is a hammer, every problem looks like a nail"

You do not solve the problem with turning off the power in the program. You solve this problem with redundant power supplies, batteries, etc.

+2
Sep 17 '09 at 14:57
source share

There are several commercially available structures Veritas, Sun HA, IBM HACMP, etc. etc. which automatically tracks processes and starts them on another server in case of failure.

There is also expensive equipment, such as the HPs Tandem Nonstop range, which can withstand internal equipment failures.

However, software is built by nations, and nations love to make mistakes. Consider the cautionary history of the IEFBR14 program shipped with IBM MVS. This is basically a NOP mock program that allows you to declare a JCL bit without starting the program. This is the whole source code: -

IEFBR14 START BR 14 Return addr in R14 -- branch at it END 

Nothing is easier? Over its long life, this program actually collected a bug report with an error and is now on version 4.

Thats 1 error up to three lines of code, the current version is four times the size of the original.

Bugs will always creep, just make sure you can repair them.

+2
Sep 18 '09 at 8:54
source share

If the failure mode is limited by hardware failure, VMware Fault Tolerance claims you want it. It runs a pair of virtual machines across multiple clusters and uses what they call vLockstep, the primary vm sends all states to the secondary vm in real time, so in the event of a primary failure, execution is transparently flipped to the secondary.

My guess is that this will not help a communication failure, which is more common than a hardware failure. For serious high availability, you should consider distributed systems such as the Birman team approach ( pdf paper or the book Reliable Distributed Systems: Technologies, Web Services, and Applications ).

+2
Sep 18 '09 at 23:39
source share

The closest approximation is SQL. However, this is not a language problem; this is basically a VM problem. I could imagine a Java VM with these properties; this will be another matter.

A quick and approximate approximation is achieved using the application checkpoint. You lose the "die at any moment" property, but it is pretty close.

+1
Sep 10 '09 at 8:03
source share

I think that his fundamental mistake for recovery will not be a significant design problem. Responsibility for the invincibility of the environment leads to the usual fragile solution, intolerable by internal malfunctions.

If it were me, I would invest in reliable equipment and develop software so that it can automatically recover from any possible state. For your example, database session maintenance should be handled automatically using a fairly high level API. If you need to manually connect, you are probably using the wrong API.

As others have noted, programming languages ​​embedded in modern RDBMS systems are the best you can get without using an exotic language.

Virtual machines are generally designed for this kind of thing. You can use the VM client APIs (vmware..et al) APIs to manage the periodic checkpoint in your application, if necessary.

In VMWare, in particular, there is a playback function (Enhanced Record Record), which records ALL and allows you to play in time. Obviously, this approach is a huge success, but it will meet the requirements. I would just make sure that your disks have a backup cache for writing.

You will most likely find similar solutions to run java bytecode inside the Java virtual machine. Google fault-tolerant JVM and virtual machine checkpoint.

0
Sep 15 '09 at 18:05
source share

If you want to save information about the program, where would you save it?

It must be saved, for example. to disk. But this would not help you if the disk failed, so it is no longer a disaster.

You will only get a certain level of detail in the saved state. If you want something like tihs, then probably the best approach is to determine the level of detail, in terms of what constitutes an atomic operation and storing the state in the database before each atomic operation. You can then restore the atomic operation to such an extent.

I do not know a single language that would do this automatically, because the cost of maintaining state for secondary storage is extremely high. Thus, there is a trade-off between the level of detail and efficiency, which is difficult to determine in an arbitrary application.

0
Sep 15 '09 at 18:22
source share
  • First, run the fail-safe application. One, where, where, if you have 8 functions and 5 failure modes, you performed an analysis and test to demonstrate that all 40 combinations work as intended (and at the request of a specific client: two of them will most likely disagree) .
  • second, add a scripting language on top of the supported set of fault tolerant functions. It should be as close as possible to a stateless person, so almost certainly something non-Turing is complete.
  • finally, consider how to handle recovery and restoration of the state of a scripting language adapted to each failure mode.

And yes, this is pretty much rocket science .

0
Sep 15 '09 at 20:53
source share

Windows Workflow Foundation may solve your problem. It is based on .Net and is designed graphically as a workflow with states and actions.

This allows you to maintain the constancy of the database (automatically or upon request). You can do this between states / actions. This serializes the entire workflow instance in the database. It will be rehydrated, and execution will continue when any of several conditions are fulfilled (specific time, rehydrated by software, event fires, etc.)

When the WWF host starts up, it checks the save database and rehydrates any workflows stored there. Then it continues to be executed in terms of perseverance.

Even if you do not want to use aspects of the workflow, perhaps you can still use the save service.

As long as your steps are atomic, this should be enough, especially since I assume that you have a UPS, so it can track UPS events and enhance constancy if a power problem is detected.

0
Sep 18 '09 at 2:06
source share

If I decided to solve your problem, I would write a daemon (possibly in C) that did all the database interactions in transactions, so you will not get any bad data if it is interrupted. Then start the process of starting this daemon at startup.

Obviously, the development of web materials in C is much slower than in the scripting language, but it will work better and be more stable (if you write good code, of course :).

Actually, I would write it in Ruby (or PHP or something else) and something like a Delayed Job (or cron or any other scheduler) starts it every so often because I don’t need to update the clock cycle ever .

Hope this makes sense.

0
Sep 18 '09 at 14:24
source share

In my opinion, the concept of failure recovery in most cases is a problem for the business, and not a hardware or language problem.

Take an example: you have one level of user interface and one subsystem. The subsystem is not very reliable, but the client at the user interface level should perceive it as if it were.

Now imagine that somehow your subsystem crash, do you really think that the language you imagine might think about how to handle the user interface level depending on this subsystem?

Your user should clearly know that the subsystem is not reliable, if you use messaging to ensure high reliability, the client MUST know this (if he does not know, the user interface can simply freeze the waiting response, which may eventually come 2 weeks later). If he needs to know about it, that means that any abstractions to hide it will eventually flow.

By client, I mean the end user. And the user interface should reflect this insecurity and not hide it, the computer cannot think for you in this case.

0
Sep 18 '09 at 20:12
source share

"So, the language that this state would remember at any moment, regardless of whether the power shuts down and continues where it left off."

“Continuing where he left off” is often not the right recovery strategy. Not a single language or environment in the world will try to guess how to automatically recover from a specific error. The best he can do is provide you with the tools to write your own recovery strategy so that it does not interfere with your business logic, for example.

  • Exception handling (for fast failure and state consistency)
  • Transactions (to discard pending changes)
  • Workflows (for defining recovery procedures that are called automatically)
  • Logging (to track the cause of the failure)
  • AOP / dependency injection (to avoid the need to manually insert code to execute all of the above)

These are very common tools and are available in many languages ​​and environments.

0
Sep 18 '09 at 20:52
source share



All Articles