Прикладне програмне забезпечення 
© Dmytro Rahozin, 2020 
ISSN 1727-4907. Проблеми програмування. 2020. № 2–3. Спеціальний випуск  149 
UDC 681.3 https://doi.org/10.15407/pp2020.02-03.149 
MODELS OF CONCURRENT PROGRAM RUNNING IN RESOURCE 
CONSTRAINED ENVIRONMENT 
Dmytro Rahozin 
The paper considers concurrent program modeling using resource constrained automatons. Several software samples are considered:  
real time operational systems, video processing including object recognition, neural network inference, common l inear systems  
solving methods for physical processes modeling. The source code annotating and automatic extraction of program resource cons traints 
with the help of profiling software are considered, this enables the modeling for concurrent software behavior with minimal user 
assistance. 
Key words: concurrent programs, program annotation, software modeling, automatons. 
У роботі розглянуто моделювання паралельних програм за допомогою автоматів з ресурсними обмеженнями. Розглянуто певні 
приклади програм: операційні системи реального часу, обробка відео та розпізнавання образів у відеопотоці, вивід у нейромережах, 
типові методи розв’язку систем лінійних рівнянь для моделювання фізичних процесів. Розглянуто анотування вихідного коду та 
автоматичне отримання ресурсних обмежень програми за допомогою програм-профайлерів, що дозволяє моделювати поведінку 
паралельної програми з мінімальною допомогою користувача.  
Ключові слова: паралельні програми, анотування програм, моделювання програм, автомати. 
В работе рассмотрено моделирование параллельных программ с помощью автоматов с ресурсными ограничениями . Рассмотрен 
ряд примеров программ: операционные системы реального времени, обработка видео и распознавание образоив в видеопотоке, 
вывод в нейросетях, типовые методы решения систем линейных уравнений для моделирования физических процессов. 
Рассмотрено аннотирование исходного кода и автоматическое получение ресурсных ограничений программы с помощью 
программ-профилировщиков, что позволяет моделировать поведение параллельно программы с минимальной помощью 
пользователя. 
Ключевые слова: параллельные программы, аннотирование программ, моделирование программ, автоматы. 
Introduction 
The growth of software complexity involves a vast amount of research for improvement of practical software 
verification, testing theory and testing techniques. Still there is a huge gap between abilities of formal verification 
techniques and modern needs in fast and less expensive software verification, and this gap is usually bypassed now 
by increasing efforts for software test cases development. This leads to the “state-of-art” when the software has a big 
number of errors, but the test case sets and day-by-day use scenarios hide the untested errors and software looks to 
be “working good enough” and ready to deploy. The motivating examples here are the untouched corner cases, 
resource allocation problems and erroneous synchronization protocols – all this mix usually goes smooth, but in rare 
cases fail. Due to rise of complex Linux/FreeRTOS environments anywhere in embedded/IoT/automotive a 
developer is able to create more complex software using available Linux toolset, so the software complexity 
increases exponentially. Of course, the software complexity increase leads to the dramatical improvement in product 
usability but multiplies the number of untested scenarios – but this process is unstoppable due to use demand for 
improved software usability.  
In the paper we consider the cutting-edge modern software applications cases which are affected by software 
complication growth: software parallelization, concurrent software runs and thread interaction in real-time reactive 
systems. The concurrent software running on modern parallel platforms is the usual consequence of efforts for 
minimizing software complexity – as a result a monolithic “serial” program becomes a system of multiple synchronized 
concurrent threads. The problem here raises that the set of tested thread interaction cases usually covers only well-
known interaction scenarios, but in case of resource access denials, lack of resources or temporary device timeouts the 
software system (sometimes not a system but a mix of threads) fails. In case of parallelized software, the 
synchronization protocol between execution threads may have drawbacks (mis-synchronization, data loss) and resource 
access may cause dead-locks. In case of a real-time reactive system, the firmware consists of several concurrent threads 
under control of simplified real-time operation system without hardware memory management, without blocking thread 
access into another thread/process memory. In such case even simple non-tested errors (for example while performing 
pointer arithmetic) cause system fails, so still the stability of software system is guaranteed by hardware watchdog 
timer. 
In the article the theoretical aspects and several practical cases are considered. These cases can improve the 
management of corner-case and conditions, when the complex software with concurrent threads may fall into erroneous 
or deadlock conditions due to poor resource (memory, time, etc.) management or simple insufficient resources. The 
resource constrained automatons are proposed to acts as a model for concurrent processes, and the resource allocation 
for software models is extracted from results of real software runs. Resource-constrained automaton scalability is 
addressed as the wrong scaling of software models leads to impossibility to use the models in real life. 
Прикладне програмне забезпечення 
150 
Problem statement  
Concurrent thread run is not limited only to parallelization of some computation scenario, more beneficial is 
concurrent run of processes for utilizing available resources. As example – parallel build of source files in project, 
parallel processing of web site pages or the most popular last years — mining of cryptocurrency. 
The well-known Moore’s “law” predicted that the computational capability of computers increases twice per 
eighteen months. This contradicts the serial execution nature of the 95% of developed software, which benefits only 5% 
speedup with each modern processor generation – it is impossible to speed up serial-fashion program but it is possible 
to run many independent programs. The “silver bullet” for the computer industry looks to be the parallelization of 
software, but the economical side of the parallelization looks to be very (or extremely) expensive. Even worse, the math 
software parallelization methods are studied and evaluated long time ago, and current common computer architectures 
limits the parallelization gain of software by cache memory structure, size and memory cache to main memory bus 
bottlenecks. So many computational methods are limited in internal parallelism for 2-4 concurrently working CPUs, but 
the shortcoming here is that the off-the-shelf CPUs for year 2020 have 8 cores, so the computation capability of middle-
level processor exceeds usual software abilities to employ the proposed level of parallelism. 
The high parallel computing capability may be easily employed by handling multiple processes, which is 
common e. g. for modern video content processing – where the other resources (network, memory) utilization is highly 
predictable. In case of running different types of processes the real parallelism is limited by existing memory and disk 
I/O – it is a common problem that running more processes may slow down all the system as memory/disk I/O starts to 
be a bottleneck. 
The proper balance of resource capabilities of a system is an important task for CPU clusters as the resource 
balanced system has additional benefits for user: 1) less processing time, as we do not need extra memory-disk-memory 
data offloading operations; 2) improved financial efficiency, as hardware resources are employed efficiently and electric 
and cooling bill is less than in unbalanced case; 3) overall software system can be scaled for “bigger” (in terms of 
computational power) computing power with less efforts. 
In this paper we consider technologies – from formal models to hardware simulations which allows to annotate 
and characterize resource use of parallel (and serial) applications, targeted to work on clusters or simpler computers 
with many CPUs. Later this annotation allows to balance concurrent processes execution at some hardware platform. 
One of target software packages for such kind of parallelization optimization is the weather forecast software from 
National Meteorologic Institute [1], which is a good point of parallelization and further efficient execution on a parallel 
computer or cluster. The goal of the forecast package optimization is the continuous use of the software in “software-as-
a-service” manner, providing the heavy numerical computations with optimized resource allocation.   
Software resource consumption modeling 
The first question which raises here is “why we need to account resources if some software can run in virtual 
environment on some server”. The question is correct if we are approaching to parallel software from business side. 
Virtual environments such as Docker (www.docker.com) allows to run some application in a sandbox creating an 
illusion that all system resources are manageable by a program. The resource management is passed to supervisor side, 
and – if several applications start to consume extra resources simultaneously, the system performance will be degraded 
quickly. (Here different techniques invented to offload highly-loaded computers are not discussed). So the system 
administrator should reserve resources for the worst system load case. It should be noted that virtual environments are 
not the proper tool for parallelization, parallelization here works only for simple cases, such as running compilers for 
big source bases, but even in this case disk I/O is the performance limiting factor. Virtual machines management cases 
work best for providing unchanged environments (which tends to be obsolete in a half of a year) for specific software 
with great effort savings on application support, until the software system can be changed completely to the next 
generation one. 
The second question which raises here is “how to employ resource consumption data to day-by-day tasks and 
how to use it to improve application performance”. Taking back to Moore’s law we see that the way to improve 
application performance is to employ more transistors on chip, and the number of transistors on chip grows faster than 
linear last 30 years (but not exponential due to technology and financial limitations). Therefore, these transistors need to 
be loaded with some useful workloads otherwise there is no sense in technology improvements. But even this 
dramatical technology improvements still does not allow to do some modern things with appropriate speed. Neural 
network inference suffers from low computational speed in all artificial intelligence (AI) tasks. Video processing in real 
time in HD resolution requires enormous amount of computational capacity. Real-time industrial systems are severely 
limited in real-time environment sensing. Let us consider several motivating cases. 
The good modern case is video processing for streaming cameras – each camera streams an encoded video, 
which should be got on receiver side (network bandwidth resource is consumed), decoded (hardware video decoder 
resources are employed, which capacity is limited on a chip), pre-processed (computational capacity grows with image 
resolution, as example switching from VGA 640x480 resolution to FullHD needs 6-8x computation resources 
improvement, 4K-format image requires 4 times more resources than FullHD), processed inside computer vision 
functions (computational requirements raises linearly with picture resolution) and post-processed. Video cards, which 
are widely used to process images, have fixed amounts of memory; fixed number of executional blocks, special video 
Прикладне програмне забезпечення 
151 
processing units (decoders and encoders), limited I/O bandwidth on external memory bus. Each video decoding thread, 
video filtering blocks, neural network based image recognition blocks requires some defined amount of video memory 
(depends on actual video stream resolution), some allocation of general purpose computational units and specialized 
units, memory bus bandwidth allocation. For the most cases resource consumption requirements are changed over the 
time. So even for proper video card computational load we need to solve – speaking formal way - quite complex 
dynamic programming task. 
Another motivational (and modern) case is the resource scheduling for real-time (semi real-time) embedded 
applications. Employing embedded Linux (Yocto – yoctoproject.org) for low-power processor had enabled complex 
Linux-based environments or real time quasi operation systems (e.g. FreeRTOS – freertos.org) for embedded chips 
(with clock rates up to 1 GHz), which able to process data streams with sampling frequencies exceeding dozens of 
megahertz. For such cases data streams generation and network traffic cannot be fit into static memory allocation 
concepts, so applications may temporary allocate extra resources and may get into deadlock on a platform with limited 
number of resources. The worst scenario here is that the system may hang in dead lock for infinite time instead of hard 
reset and continuing working. 
The reasonable way to avoid resources deadlocks is the prediction of possible resource allocation scenario for a 
set of concurrent tasks. The main idea of limiting overall resource consumption is that the real-life concurrent tasks are 
synchronized some way and therefore peak resource consumption intervals have no intersection in time. For non-
synchronized tasks the system resource pool should be as large as the sum of maximal resource consumption of each 
concurrent task. For synchronized tasks their resource consumption should be analyzed in time domain to form a 
roadmap for resource consumption for each task and predict resource consumption for the whole system. The formal 
apparatus for such kind of analysis was initially presented in [2] and extended for resource allocation in [3]. It should be 
noted, that we are not considering only hard real-time systems, where the execution time of operations is strictly 
determined. Between synchronization points the tasks are not limited to clock-accurate instruction simulation – 
instruction execution performance may vary and time between resource allocation and deallocation may differ up to 
20 % run to run. This fact involves probabilistic analysis of task execution and resource consumption. 
The model formalization may be done using timed I/O automatons [4] or the program execution modeling 
algebra [4]. Exact practical aspects of real-time software modeling are considered in [6], this paper highlights typical 
software structures which can be efficiently modeled using such algebra. Each transition between states is enabled only 
in case, if all necessary resources are allocated. So, for a distinct resource-constrained model each transition has a 
corresponding resource requirement, which should be meet at concurrent execution. The program (thread) execution 
forms a temporal sequence of resource constrains. 
Program execution models with resource automatons 
The basic question for every software modeling (using temporal logic, timed automatons, imitational simulations 
via mathematical apparatus or specialized simulations, i.e. network simulations) is the modeling scale. It is impossible 
for the most cases to simulate a program at basic blocks [7] level, but this is unnecessary. The automaton-based model 
of program execution should keep program basic properties, which potentially may cause the deadlocks. Let us consider 
several cases of common program scenarios in the previously mentioned areas: 
Case 1. Parallel programs, derived from serial ones by parallelizing loops and utilizing multiple processor cores 
have the main goal to shorten execution time. The automaton description describes memory resources, used for 
parallelized loops, dynamic memory allocation and deallocations, synchronization points, mean relative execution times 
for loop nests, typical execution time for program parts. The good samples of parallelized program kernels are e.g. 
solving linear algebraic systems (conjugate gradient method) and gas dynamics calculations. The sample automatons for 
conjugate gradient method case is shown at figure 1. 
𝑘 = 0; ?̅? = 0; ?̅? = ?̅?; 𝜌0 = ‖?̅?‖2
2 
while √𝜌𝑘 > 𝜀 ‖?̅?‖2 ⋀ 𝑘 > 𝑘𝑚𝑎𝑥  
        k = k+1 
        if  (k=1)      ?̅? = ?̅? 
        else      𝛽𝑘 =
𝜌𝑘−1
𝜌𝑘−2
 
                    ?̅? = ?̅? + 𝛽𝑘?̅? 
        endif 
        ?̅? = 𝐴?̅? 
        𝛼𝑘 =
𝜌𝑘−1
𝑝𝑇̅̅ ̅̅  ?̅?
 
        ?̅? = ?̅? + 𝛼𝑘?̅? 
        ?̅? = ?̅? − 𝛼𝑘?̅? 
        𝜌𝑘 = ‖?̅?‖2
2 
end while 
S0 
S1 
 
 
 
 
 
S2 
S3 
 
S4 
 
S5 
 
Figure 1. An automaton for conjugate gradient method case 
Прикладне програмне забезпечення 
152 
The conjugate gradient method algorithm derived from [8] consists of six main states where computationally 
intense or memory access intense operations are provided: S0-S5. Computationally intense operation means that the 
significant part of computations inside the software module (0.1% - 1%) was made during move from this state to the 
next state. Memory access intense operations means that more than 25% of cache memory storage was updated during 
the operation. Each state is annotated with resource annotation Resm and ResT – for memory and time resources. 
Memory resource annotation defines how many memory resources are (de)allocated at the state, time resource 
annotation defines the time which is necessary to process the transition between the automaton states (the time 
necessary to compute the operations). Time annotation is proportional to the data touched during the operation. 
It should be noted that the time annotation is not measured by the number of memory accesses or computations. 
It is measured by the typical operations computed during the transition. At figure 1 typical operations are: vector copy, 
scalar multiplication of vectors, matrix-vector multiplication, SAXPY [8]. For these operations the execution time is 
mainly defined by CPU cache configuration, cache renewal policy and cache misses. The operation run time depends 
hardly on the fact if the processed data fits into the cache memory and if the computation operation has the necessary 
data resident in cache just before the computation starts. So the run time depends on dataset size, cache memory 
configuration, operation type and cache state, but not on the number of operations and memory accesses. 
Case 2. Annotations for concurrent programs/threads should indicate memory allocations, synchronization 
points, mutex sections and I/O accesses. Thread execution times may vary due to different system load (OS 
interventions, I/O bandwidth and blocking resources by other programs). For specific non-shared hardware cases (for 
example video cards equipment – coding/decoding units, etc) mutex sections are used, which is annotated by special 
resource type ResD. A 
The typical example of annotated video processing pipeline automaton is shown at figure 2. 
There are two automatons here – the outer one initializes the inner automaton. This two-level structure is 
required because the outer automaton needs to allocate memory and device resources, as if the resources are insufficient 
all other operations are useless. The inner automaton shares memory and device resources and requires the device 
resource to be shared during the defined amount of time. In practice this allows to model various scenarios for device 
resource allocation and model the scenarios where the video card resource will be used the best way. 
 
States: 
S’: video processing pipeline 
initialization 
S0: catch frame from network 
S1: stream depay 
S2: decode 
S3: format conversion 
S4: post-processing 
S5: rendering to target 
S’’ video processing pipeline  
de-initialization 
Figure 2. An automaton for video processing pipeline case 
This case shows us three types of resources: memory (Rim) which define positive and negative quantities, 
reflecting actual memory budget for automaton parts; time (RiT) which defines time necessary to complete a transition 
to the next state and blocking resource (RiD) – usually a hardware resource which needs to be allocated exclusively to 
complete transition to the next state. It should be noted that timing, memory and blocking resource constraint concepts 
looks very similar to compiler code generation technology for basic blocks [7], where concurrently compiled operations 
compete for resource allocation, but usually greedy methods fail to optimize resource allocation. 
Case 3. Real-time firmware. The main difference (if compared to (1) and (2) cases) are the big number of 
uninterruptible execution sections (if compared to other automatons where the actual execution times are automatically 
extended) and precise execution timings – except parts where software interrupts handlers should do their work. Here 
one of the real life verification goals is the ability of all system threads to fit into processor execution time and share 
processor resources without time conflicts. Nowadays the waste of the significant part of processor time in 99% matters 
is allowed and is not a big problem, but on the other hand the reaction speed for embedded (medical, autonomous 
driving) applications is critical. Resource and timing conflicts may cause the real-time system to go out of strict timing 
with severe damage of other system components. Another point is the software upgrade of existing hardware system 
and the computational capability of system should be investigated. The real-time system analysis is the extremely 
important, but this case needs to be reviewed in deep and this is out the scope of this paper. 
Case 4. Neural network inference is also a corner-stone case for computational capability analysis. The network 
inference pipeline consists of sequence of computation blocks with precisely defined memory and computational 
complexity resources, synchronization points are easily constructed using the network structure description. The 
automaton for classical image segmentation case [9] is shown at figure 3. The main goal of execution pipeline analysis 
is the maximal use of computational resources and specialized devices. 
Прикладне програмне забезпечення 
153 
 
Figure 3. Automaton: segmentation neural network  
The figure 3 shows the usual execution sequence for neural network “forward” step. Memory and resource 
constrains are not trivial, as S5 state depends on data computed during S4 state, S6 on S3, S7 on S2, S8 on S1, and 
finally S9 on S0. Timings also are unique for each state. For video cards additional dependencies are introduced, which 
heavily rely on hardware units, memory hierarchy and compiler options for each particular case. 
After showcasing various automatons for some practical tasks another question raises – the real annotated 
automatons need information about resource allocation (memory, processor time, specific device resources). Even if 
annotation is somehow simplified - there is no need to annotate all processors instructions, but only meaningful (in terns 
of various resource consumption) blocks of code, but still we need semi-precise quantities for annotations. Let us 
consider how to get precise information about resource consumption. 
Resource consumption modeling 
Among the used resources there are resources, which amount can be obtained statistically and resources, which 
utilization should be known from supervisor. For example, if a program is executed on a videocard, many utilization 
resource schemes are possible: the program can be executed using different number of devices and sometimes is it not 
possible to obtain information about real amount of consumed resources. Still there are many resource types which 
utilization can be calculated using statistical information about program run.  
In order to analyse the software we need to model it using a rough model based on automaton model. The rough 
model needs annotating with resource consumption, which is has enough details to model a software system. One of 
reliable sources for annotations are software analysis systems, let us consider it at Cachegrind [10] example. 
Initially Valgrind&Cachegrind (below in the text it will be referenced as “Cachegrind”) were developed as a 
software for checking memory leaks and modeling cache memory use effects for different processors. Basically 
Cachegrind simulates program execution instruction-by-instruction, using instrumented heap manager. At the first 
glance this looks useless, but even modern processors have very limited number of tools for memory debugging – 
usually they can watch only 4 or 8 small memory regions and their abilities are additionally limited by operating 
system. As memory leaks debugging is run usually after fixing most of middle and low severity bugs in software, so the 
execution time is not the issue here. Cachegrind allows to trace operating system calls and memory space by controlling 
each memory access. At instruction execution level Cachegrind has exhaustive statistics about used computational 
power. The program, which is run and analysed under Cachegrind control is called instrumented program. 
Cachegrind technology also enables precise modeling of processor cache memory, DRAM behavior and 
interprocessor communication. The most used is cache memory simulation, there practically all possible memory 
configuration may be evaluated for existing software. As instruction optimization does not benefit as much as memory 
optimization (except pipelining in deep nested loops) – even for modern SIMD instructions, which just quadruple 
memory bus load – the precise cache simulation highlights the data traffic between CPU code and cache memories. An 
example of cache memory behavior is presented at figure 4.   
 
Figure 4. Cache memory simulation example 
Прикладне програмне забезпечення 
154 
The figure shows the number of cache misses into instruction cache and 1st and 2nd levels of data cache. The 2nd 
data cache miss rate is the actual traffic to memory (need to multiply it by cache memory line size). At figure 4 it is 
clear that function array_sum() produces 125000 1st and 2nd level cache read misses (memory read requests) and 
125000 1st and 2nd level write misses (memory write requests). This leads to 8MB traffic to cache memory, which is the 
real dataset size and practically annotates the code block. The memory access pattern in fact defines the time which the 
function needs to run and the function run time may be calculated here precisely as the memory access time dominates 
among other factors which define the function run time. It should be noted that the number of memory accesses depends 
as polynomial on the cardinal number (here – N) of the algorithm, so can be interpolated and the correct value of 
memory traffic can be calculated for practically all cases of program execution. 
Other good sample for accounting memory resources is neural network inference procedures. The most network 
operations are matrix multiplications [11] with well-known memory footprint and communication patterns (for parallel 
matrix multiply versions). For shared memory multiprocessors matrix multiplication is parallelized using e.g. BLAS 
(netlib.org/blas) also with well-known footprint and interprocessor traffic. Therefore memory timings, memory traffic, 
memory footprint are well-known and as the memory multiplication is mostly memory traffic bounded - the execution 
time is highly predictable. Synchronizations between inference computations for each level are well-defined, so the 
resulting automaton with annotations described the process execution with high precision. Neural network inference, 
represented as an execution graph usually is acyclic which highly simplifies automaton analysis. 
The only annoying issue for Cachegrind is the highly increases instrumented program execution time. If 
Cachegrind simulation time is compared to a simulation based on hardware processor event counters (used by all 
modern profilers such as Vtune), the latter gives practically no overhead, but gives less accurate information for process 
time line, as the event counter sampling is not made instruction accurate (as statistical sampling methods are used). 
Cachegrind may introduce up to 100x overhead. On the other hand, for applications, which are run under the same 
dataset size (e. g. such as forecast package [1]) the resource constraints may be calculated only once for all future 
execution scenarios as resource utilization is not changed, or re-calculated statistically, so only one or several profiler 
runs are enough to get necessary annotation data. 
Using resource annotations for resolving resource conflicts 
Let us consider a simple sample of two concurrent threads which periodically request for some amount of shared 
resource, and the resource pool capacity is less than the sum of all requests from concurrent threads. The left part of 
figure 5 shows the two automatons for two concurrent threads with annotations for shared resource RX demand. For 
figure 5 the condition is: 𝑅𝑋
1 + 𝑅𝑋
1 > ‖𝑅𝑋‖, so the simultaneous entering states S1 and S1’ causes deadlock at memory 
allocation. In order to avoid the deadlock the states S1 and S1’ should not be processed simultaneously – entering states 
S1 and S1’ should be synchronized, processing S1 and S1’ states should be mutually exclusive. The right part of 
figure 5 shows the source automaton after transformation, where additional states T1, T1’, T2, T2’ are introduced. 
These states act as semaphores to synchronize entering S1 and S1’ states. Such kind of synchronization can be built on 
the top of old good standard system libraries with semaphores/mutexes, etc (pthreads, for example). The more fruitful 
point is that the tools and instrumentation, which allows to introduce synchronization points may process automatically 
using annotations into the source program text and transform the source code without need to work at compiler (code 
generation) level. Even more – these annotations can be reused in case of changed data set size, using Cachegrind-
gathered information. 
The resource allocation problem is not unique for resource-constrained automatons and was studied well in 
compiler theory, where main hardware processor execution paths are considered as resource-constrained automatons 
and the code generator should choose among multiple instruction execution combinations only one which utilizes 
maximum system resources [7]. As the representation of resource set is very similar to resource representation for code 
generation for some processor – for resource-constrained automaton, the set of compiler-related techniques for resource 
allocation may be considered, for example Linear Scan for Registers, which uses heuristics for decreasing time of 
efforts necessary for analysis of resource constrained automation work.  
 
Figure 5. Automaton transformation 
Прикладне програмне забезпечення 
155 
The constraint resolving may used quite simple greedy methods, this works well for automatons from figure 5 
(left part). Potentially or really conflicting states are enclosed with a pair of states (T1 and T2, T1’ and T2’), which 
works in pair and provides a semaphore object to orchestrate resource use. The method is extensible, as such type of a 
semaphore can be extended for any number of parallel threads. It should be noted, that the automatons should be 
verified for absence of resource leakage, so that processing over a closed path should not increase the amount of 
consumed resources. 
Still resource constraints handling for some states in a set looks to be ineffective for bigger automatons, as state 
insertions make the overall software model to complex. The solution is the use of virtual machine, which analyses the 
resource consumption between the state transitions and continues work only for some automatons in the set. Anyway 
the virtual machine is impossible without automaton analysis and analysis of automaton closed paths, as greedy 
approach for processing automaton schemes may be cause deadlocks here. The virtual machine concept analysis will be 
considered in future papers.   
Source code annotations enables the used and added states and/or the virtual machine. Annotations include 
automaton states description and resource consumption information. Similar to modern automatic documentation 
systems (i.e. well-known Doxygen) annotation are placed in comments with some specific header, for example the 
possible automaton description – the automaton similar to left ones from figure 5 – follows (the left column is a 
description in the source code, the right column included the description explanation): 
/// State S0: R[Mem]=+2MB; R[T]=+4MC 
/// State S1: R[Mem]=+124MB; R[T]=+120MC 
/// State S2: R[Mem]=-126MB; R[T]=+30MC 
/// Trans: S0>S1; S1>S2; S2>S0 
State S0: memory consumption 2MB, time – 4 million cycles 
State S1: memory consumption 124MB, time – 120 million 
cycles 
State S2: all memory marked as free, time – 30 million cycles 
Transition table 
This description can be automatically extracted from code comments by a utility, similar to C/C++ preprocessor 
or a command-driven stream editor (similar to Linux sed). In the source code the beginning of each state is marked with 
a simple comment line similar to the following: 
/// State S0 code start 
/// State S0 info start 
/// State S0 info end 
Start of code (place synchronization primitives here) 
Start of code where Cachegrind info should be gathered 
End of code where Cachegrind info is gathered 
This “code start” annotation marks the place where the synchronization primitives can be placed (inserted) by 
automatic preprocessor or a virtual machine should reserve resources for the state processing. The synchronization 
primitives and virtual-machine based thread control can be based on the top of some old good standard POSIX library, 
e.g. pthreads. In case if the automaton granularity was chosen correctly the time overhead for synchronization is 
neglible.  
The precise information about resource consumption can be derived from Cachegrind information. Specific 
annotations points to code places where Cachegrind gathers performance information should be summarized to get 
memory traffic consumption metric (L2 misses). The sample Cachegrind summary (shown at figure 4) and quite a 
simple utility can be used to extract e.g. L2 misses information (the same sed stream editor utility from Unix may be 
used) and calculate summary information for an automaton state. The extracted information can be placed into external 
definitions file (C-style header file) to use while considering automaton resource consumption descriptions (in virtual 
machine) _– and organized as a record of resource consumption values. 
The major benefit of the presented automaton description is that the automaton analysis can be done in less or 
more automatic way, with minor human interventions. The timing and memory consumption information can be 
gathered semi-automatically via profiling tools – Cachegrind is only one from the wide class of profiling software 
which can be used to evaluate resource consumption. The virtual machine tracks automaton model run and only in case 
of incorrect or complex model the user intervention is necessary. The simplest automaton analysis can be done in 
greedy ways, more complex techniques such as forward resource scans may be tested. If the automaton will be 
expressed in some standard way, for example [6], the existing verification software can used for automaton evaluation 
at least for time metric, which is useful for verification and consideration for real-time software. The practical algorithm 
is not considered now in the paper – it’s the topic for the following research for resource constrained automaton analysis 
for real software automatons. 
Conclusion 
Various practical aspects of optimizing resource consumption for heavy loaded computer systems are considered 
in the paper. The formalism defined in [3] can be used to describe the concurrent running threads for example as 
automatons with resource consumption annotations. Several practical examples of expressing software 
processes/threads as automatons are presented. A methodic for gathering resource consumption information using state-
of-the art profiling tools was presented. Also, a methodic for annotating the program code with automaton descriptions 
Прикладне програмне забезпечення 
156 
was provided which enables to control the automaton execution via a virtual machine. Automaton verification can be 
provided using existing verification tools and model environments. 
The next point of research is the efficient use of verification tools based on existing frameworks. It is possible to 
reinvent the wheel and to create another verification tool, but the more useful activity is to re-use existing tools and 
frameworks and this is the goal of the upcoming research. 
References 
1. Prusow V.A., Doroshenko A.Yu. Modeling natural and technogenic processes in atmosphere. Kyiv: Naukova dumka, 2006. 541 p. 
2. Ahlem Triki, Jacquez Combaz. (2013) Model-Based implementation of Parallel Real-Time Systems. Verimag Research Report TR-2013-11 
3. Rahozin D. A resource limited parallel program model. Problems in Programming. 2008. N 4. P. 3–10. 
4. Kaynar D., Lynch N., Segala R., Vaandrager F. The Theory of Timed I/O Automata. Morgan&Claypool. 2011. 137 p.  
5. Peter Hui and Satish Chikkagoudar. (2012) A formal model for Real-time Parallel computations. In Proc. Of FTSCS-2012. P.39-53. 
6. Basu A., Bogza M., Sifakis J. Modeling heterogeneous real-time components in BIP. In SEFM. 2006. P. 3–12. IEEE Computer Society. 
7. Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann 1997. 
8. Golub G., Van Loan C. Matrix computations. Johns Hopkins University Press, 1996.  
9. Minaee S., Boykov Y., Porikli F., Plaza A., Kehtarnavaz N., Terzopoulos D. Image Segmentation Using Deep Learning: A Survey. 
arXiv:2001.05566v2 
10. Weidendorfer J. (2008) Sequential Performance Analysis with Callgrind and KCachegrind. In: Resch M., Keller R., Himmler V., Krammer B., 
Schulz A. (eds) Tools for High Performance Computing. Springer, Berlin, Heidelberg. P. 93–113. 
11. Chetlur S., Woolley C., Vandermersch P., Cohen J., Tran J. cuDNN: Efficient Primitives for Deep Learning.  Nvidia, 2014. arXiv:1410.0759v3 
Література 
1. Прусов В.А., А.Ю. Дорошенко. Моделювання природних і техногенних процесів в атмосфері. Київ: Наукова думка, 2006. 541 с.  
2. Ahlem Triki, Jacquez Combaz. (2013) Model-Based implementation of Parallel Real-Time Systems. Verimag Research Report TR-2013-11 
3. Rahozin D. A resource limited parallel program model. Problems in Programming. 2008. N 4. P. 3–10. 
4. Kaynar D., Lynch N., Segala R., Vaandrager R. The Theory of Timed I/O Automata. Morgan&Claypool. 2011. 137.  
5. Peter Hui and Satish Chikkagoudar. A formal model for Real-time Parallel computations. In Proc. of FTSCS-2012. 2012. P. 39–53. 
6. Basu A., Bogza M., Sifakis J. Modeling heterogeneous real-time components in BIP. In SEFM. 2006. P. 3-12. IEEE Computer Society. 
7. Steven S. Muchnick. Advanced Compiler Design and Implementation. Morgan Kaufmann 1997, ISBN 1-55860-320-4 
8. Голуб Дж., Ван Лоун Ч. Матричные вычисления. Пер. с англ. / Под ред. Воеводина В.В. –М.: Мир. 1999. 548 с. 
9. Minaee S., Boykov Y., Porikli F., Plaza F., Kehtarnavaz F., Terzopoulos D. Image Segmentation Using Deep Learning: A Survey. 
arXiv:2001.05566v2 
10. Weidendorfer J. Sequential Performance Analysis with Callgrind and KCachegrind. In: Resch M., Keller R., Himmler V., Krammer B., Schulz 
A. (eds) Tools for High Performance Computing. Springer, Berlin, Heidelberg. 2008. P. 93–113. 
11. Chetlur S., Woolley C., Vandermersch P., Cohen J., Tran J. cuDNN: Efficient Primitives for Deep Learning.  Nvidia, 2014. arXiv:1410.0759v3 
 
 
Received 20.02.2020 
 
 
 
 
About the author: 
 
Dmytro Rahozin,  
PhD, 
Senior Scientific Researcher, 
Publications in Ukrainian journals – more than 10.  
Publications in foreign journals – 2.  
http://orcid.org/0000-0002-8445-9921. 
 
Authors’ place of work: 
 
Institute of Software Systems National Academy of Sciences of Ukraine,  
03187, Kyiv-187,  
Academician Glushkow Avenue, 40. 
Phone: (38)(044) 526-21-48. 
E-mail: Dmytro.Rahozin@gmail.com