# Complex data analysis

## Introduction

### General Objectives

This section deals with techniques for extracting meaningful information from a complex system, and specifically from air traffic management, by analysing sets of existing data. Complex systems are characterized by a large number of elements, interacting in a non-linear fashion; this results in a high importance of interactions, as they can lead to the emergence of behaviours at a macroscale that cannot be predicted by means of the analysis of each single element at a microscale. The ATM system dynamics are characterized by several features that are common to complex systems; among them, the presence of a large number of heterogeneous components, with different relationships happening on multiple temporal and spatial scales, and with the presence of adaptation and self-organization. Many of those characteristics could be modelled looking at the different datasets collected through the operation of the system. Tens of thousands of daily operations encapsulate important information on the actual system behaviour, the relationship between different phenomena and the cause-effect relationships that escape from pure theoretical analysis. Additionally, extracting knowledge from data becomes especially challenging due to the fact that some data is semi-structured or unstructured, some represent the same information in different periods without an obvious characteristic scale, some types of information are challenging to analyse (e.g. images, text, etc.), some datasets are imbalanced and require special classification techniques, some data is noisy and other datasets with uncertainty require ways to aggregate larger datasets into smaller ones (Zighed et al., 2009). It is expected that the knowledge discovery process would combine different techniques to unveil some of the complex system characteristics. From the pre-processing of the data to the visualization and interpretation of the results, different integrated techniques need to be developed in order to derive the knowledge from complex datasets available in ATM. Unveiling such relationships from raw data is the scope of complex data analysis. This includes three steps:

1. Extracting the interaction and the relationship between two elements of the system, by means of analysing historical data (local analysis). In particular, this task frequently presents additional difficulty since the interactions themselves are not explicitly described by a set of data;
2. Constructing system-wide representations, including all the microscale relations in a global picture; extracting meaningful knowledge from it (global analysis);
3. Air transport evolves through different time scales. The analysis of the information needs to take into account those scales to understand if evolution is occurring due to short term fluctuations or long term trends. Separating the evolution of different elements throughout the different time scales might need specialized techniques.

Without being exhaustive in the presentation of the research lines, three research lines are presented as a way to structure potential research challenges on this topic: Local data analysis, global data analysis and evolutionary properties.

### Definitions

Complex Systems were defined in Section 0.2.1.

Complex Data Analysis: By Complex Data Analysis, we refer to any data-mining activity executed over data sets corresponding to a complex system, with the aim of extracting information about the structures of relationships between the elements of the system.

Local Analysis: With the term local analysis, we refer to the study of all those phenomena and characteristics that can be defined by analysing only some elements of the system.

Global Analysis: Global analysis refers to the study of those phenomena and properties of a system that can be understood only by analysing the system as a whole; therefore, macroscale is the typical scale in which emergent behaviours of a complex system may appear and may be studied.

Relationship Between Elements: A relationship between two (or more) elements is present when their dynamics are not independent, i.e. they share some common characteristics. Two main types of relationships can be defined:

1. Correlation between systems A and B, when a given dynamic of system A corresponds to a similar (correlation) or different (generalized correlation) dynamic of system B.
2. Causality, when the dynamics of a system (A) are forced by the dynamics of a second system (B), and therefore the former can be explained using knowledge about the latter.

While correlation is simple to detect in real data, causality cannot, in principle, be assessed. Yet correlation between two systems usually hints to the presence of a causality where a third system is affecting both of them.

### Scope

This document reviews some techniques that can be used to mine and analyse the relationships existing between the constituting elements of a complex system, by starting from the analysis of historical data sets. However, only structured alphanumeric digital data are within scope. For instance, techniques to analyse maps, audio files, video files, unstructured text (e.g. incident records) are not considered; even if they are very interesting, and can unveil relevant knowledge about the system, their analysis requires specific techniques that are out of the scope of the ComplexWorld position paper.

Techniques and specific software developments to pre-process data (e.g. data parsers) are needed to carry on research in this area, but documenting those are out of scope of this document.

Identification of the different data sources, understanding of specific potential relationships among these sources and identification of potential realities that might be analysed through them are within scope, although only some of them are documented inside the case studies section. These are intended as examples of the types of analyses that can be performed on complex systems, but the ComplexWorld position paper is not meant to be comprehensive in this context.

## Research Lines

### Local Data Analysis

#### Problem Statement

When dealing with complex systems, an important part of the information about them is codified (or embedded) within the relationships between the different elements of the system. Two levels of relationships can be defined:

1. Physical connection between the elements. These connections are created by the normal operation of the elements, and are pre-defined (or hard-wired) in the system. For instance, an aircraft operating a flight between two airports has two physical connections with each one of them.
2. Functional connections. These connections are indirectly created by the dynamics of the system, but were not explicitly built into the system. Following the previous example, when two airports are involved in a flight, their free capacities are linked - if the departure airport is saturated, and the aircraft cannot take off, the destination airport gains a free slot.

While physical connections are usually simple to identify, in the sense that information about them is usually available, functional connections are more interesting. Indeed, emergent behaviours appear in this last layer, and advanced data mining techniques are needed to define and control them. Functional relationships have some inherent characteristics that make their assessment difficult. First of all, they may be (and usually are) not linear: the dynamics of an element may have unexpected effects on a second element. Second, they are noisy, both for dynamical noise (i.e., the dynamics of an element is not the same under the same conditions) and for observational noise. Finally, extreme events may appear, for which little information is available. Finally, it should be noted that two main types of relationships may emerge:

1. Correlations, i.e., situations in which the evolution of the dynamics of two elements share some common features. For instance, the capacity of two airports, in close proximity, may evolve coherently when some adverse meteorological phenomena appears nearby;
2. Causality, i.e., when the dynamics of one element is driving the evolution of a second element. As an example, the delay at take-off at one airport may cause capacity problems in a second airport.

#### Literature Review

##### Correlations

Linear dependence The most simple relationships that can appear between two features are linear dependences, i.e., situations in which one feature can be described as a linear function of a second one. The Pearson product-moment correlation coefficient (typically denoted by r of \sigma) is the most well-known measure for linear dependence between two variables X and Y. It is defined as: \sigma = \frac\{ cov(X, Y) \}\{ S _X ^2 S _Y ^2 \} where S _X ^2 and S _Y ^2 are the standard deviations of variables X and Y. \sigma ranges from ?1 to 1, where a value of 1 implies that a linear equation describes the relationship between X and Y perfectly, with all data points lying on a line for which Y increases as X increases; a value of 0 implies that there is no linear correlation between the variables. While this metric is widely used to detect relationships between elements of a data set, it has to be noted that it only captures linear relationships, which are not the main characteristics of complex systems. In what follows, other techniques are reviewed, designed to detect more general forms of dependences.

Synchronisation likelihood The concept of synchronization likelihood (Stam and van Dijk, 2002) has been specifically developed for the analysis of chaotic systems, where linear relationships are not relevant, and more complicated forms of dependence should be detected. Given two dynamical systems, A and B, the synchronization likelihood tries to answer to the following question: when the system A follows a given dynamic, does system B always answer with the same dynamic? Note that there is an important difference with a simple correlation: the dynamics of system A and B may be completely different and uncorrelated (due to the non-linearity between them); the important point is that the same reaction in B corresponds to a given action in A. The process for calculating the synchronization likelihood is described as follows. Suppose that the data set under analysis is composed of two variables, A and B evolving with time, thus generating two time series a and b. Firstly, series a is analysed in search of relations between the values of different time windows. In other words, the existence of time windows with similar dynamics is assessed, usually by means of a linear correlation. Afterwards, the same analysis is realized for b. As a last step, results for both time series are compared, and the existence of a dependence between them is assessed.

Mutual Information and Conditional Mutual Information Mutual Information is an information metric that measures the information that two systems, X and Y, share, i.e. it measures how much knowing one of these variables reduces uncertainty about the other (Guiasu, 1977). It is calculated by assessing the entropy (usually Shannon entropy) of the conditional probability distribution of the two systems. Mathematically, this is expressed by: I(X; Y) = H(X) + H(Y) - H(X,Y) where H() represents any type of entropy.

A step forward is represented by the Conditional Mutual Information, which expresses the mutual information of two random variables conditioned on a third: I(X; Y|Z) = H(X, Z) + H(Y, Z) - H(X, Y, Z) - H(Z)

##### Causality

Classical statistical instruments, like, for instance, correlation analysis, are only able to assess the presence of some common (equivalent) dynamics between two or more systems. However, we know that correlation does not imply causality. Even if it is theoretically impossible to assess the presence of causality in a data set with full precision, several tools try to solve this problem. All of them define causality from the same basic assumptions: • Causes must precede their effects in time; • The dynamics of the causes should be included inside the effects' future, and should describe their evolution. Below, two main techniques for detecting causality are described: Granger Causality and the Permutation Entropy Causality.

Granger Causality The Granger Causality (Granger, 1969; Hoover, 2001) is an extremely powerful tool for assessing the information exchange between different elements of the system, and understanding whether the dynamics of one is led by the others. Yet, it should be noticed that its implementation is extremely complicated, and its assessment has a high computational cost. The method described below, i.e., the Permutation Entropy Causality, partly solves both of these drawbacks, and is thus preferable for the analysis of large data sets.

The method for calculating the Granger Causality is here described. Let us suppose that we are analysing two time series, X and Y, and that we are specifically trying to assess the presence of a causality going from the latter to the former: in other words, is Y is driving the dynamics of X? The first step involves the use of a standard forecasting technique to the time series X, i.e., we try to forecast one of the values of the time series by analysing all the values available before that moment in time. In the second step the same process is performed, but including also data from the other time series Y. Finally, both forecasts are compared: if the one obtained by using information about Y is better than the other, a relation between both elements does exist.

There are two features of this method that should be underlined. First, Granger Causality is not symmetric: the analysis of Y given X may return different values than the analysis of X given Y. While this property is usually observed in correlation analyses, generally speaking two systems are not expected to drive each other mutually; this situation cannot be detected through Granger Causality. Second, the other techniques described in this document try to assess the presence of relationships in the simultaneous dynamics of different elements. Conversely, Granger Causality tries to identify the influence that one element may have on another, even if there is an unknown delay in the reaction.

Permutation Entropy Causality Permutation Entropy is an information metric that quantifies the complexity of structures present in a time series, thus characterizing with precision the characteristics of the underlying system (Bandt and Pompe, 2002; Zanin et al., 2012). Specifically, it is based on the calculation of the frequency of appearance of permutation patterns, that is, permutations defined by the order of relations among values of a time series. Among the different applications developed in the last decade, some of them have focused on the problem of detecting causality relationships between the evolution of different elements of a system.

A test based on permutation patterns was proposed by Matilla-García and Ruiz Marín (Matilla-García and Marín, 2008). Instead of creating normal permutation patterns for a 1-D time series, they consider a two-dimensional time series, e.g. W = (X, Y). To each subset of W, a bi-dimensional permutation pattern is assigned, and the frequency of appearance of all 2-D patterns is assessed; if the probability distribution does not asymptotically follow a $\chi^2$ distribution, both time series X and Y are not independent. This, combined with an analysis based on Conditional Mutual Information, allows an assessment of the existence of a causal relation between X and Y (Bahraminasab et al., 2008).

##### Symbolic Data Analysis

In the classical approach to data mining, a system is usually composed of a set of features organized in a matrix, where each individual takes one single value for each one of the different variables describing the system. When calculating correlations and causalities as before, it is assumed that the system fulfils this characteristic. Yet, in many practical situations, this representation is not enough to represent the complexity of the system under study, its variability, and the degree of uncertainty associated with each measurement; or the system is simply too big, and its features cannot be directly stored in the memory of a computer, nor directly analyzed. In these cases, it is useful to manipulate more complex features, like, for instance, intervals, distributions, or even several values linked by a taxonomy and logical rules: collections of such complex features are called symbolic data. The family of techniques extending the traditional data analysis to handle these structured data is called Symbolic Data Analysis (Brito, 2007; Diday, 2002). Symbolic Data Analysis enables more sophisticated studies with respect to traditional Data Analysis: for instance, it is possible to study the relationship between the probability distributions of delays of several airports and classify them in groups, without the need of analyzing each single flight. Relationships may include the detection of clusters (Gowda et al., 1991; El-Sonbaty et al., 1998), regressions (Neto et al., 2008), or Principal Component Analysis (Lauro et al., 2000), among other.

#### Research Challenges

Assessment of correlations and causalities. While the problem of detecting correlations and causalities has been extensively studied in recent decades, this has usually been done for pairs of time series, each one composed by a set of values. Yet, in the context of the air transport system, it is usual to work with multivariate time series. The most interesting example is trajectories, i.e., sets of two (or three) elements evolving with time; furthermore, it should be noticed that these components do not evolve independently, but their dynamics are strictly linked by the physics of the flight. Developing techniques to assess the level of correlation (or causality) between different trajectories is therefore a relevant research challenge, which may have important applications, for instance, in safety analysis - see Section 3.3.1.

### Global Data Analysis

#### Problem Statement

As described in Section 3.2.1, there are several interesting techniques to extract the relationships between elements of a complex system, which are hidden in historical data. Yet, this should be seen as the first step only. Emergent behaviours in complex systems usually arise from complete patterns of relations between elements. In other words, it is not enough to assess the presence of a single relationship, but the whole structure should be studied. Furthermore, there is a need to synthesize the properties of this structure in simple numbers (i.e., metrics, or features) that can be easily understood. Up to now we have considered that all elements composing the system are homogeneous, or represent similar objects. In an aeronautical context, they can be aircraft, airports, passengers, and so forth. Most of the real systems, however, are composed of heterogeneous elements interacting between them. While the former situation can be seen as a set of elements interacting on the top of a single layer, the latter requires an extension to multiple layers, each one of them interacting with the others. An example of this will be shown in Section 3.3.1.

#### Literature Review

Various scales (micro, meso and macro) were introduced in Section 0.2.2. We now explore some corresponding metrics.

##### Microscale Metrics

Number of nodes and links (N and M) - They are defined by the cardinalities of the set of nodes and links composing the network. Link density (d) - The link density is defined as the proportion of links present in the network, with respect to all possible links. This is given by d= M/N(N-1).

Entropy of the degree distribution (Hdd) - As seen in Section 3.2.2.1, it is usually useful to assess whether links are uniformly distributed across nodes or, on the contrary, few nodes act as hubs of the system, centralizing most of the connections. Information about the presence of highly connected nodes can be obtained through the degree distribution associated to the network, that is, a function P(k) expressing the fraction of nodes with a degree of k. Given a probability distribution, the entropy measures the expected quantity of information codified in it, or the uncertainty that we have about one missing value of the variable under study (Shannon, 1948). In this context, this measure quantifies how links are distributed across nodes, and has been related to the robustness of networks, i.e. their resilience to attacks (Wang et al., 2006).

Maximum degree (Kmax) - An important feature that can be extracted from the degree distribution P(k) is the maximum degree of nodes, i.e., the degree of the most connected node inside the network.

##### Mesoscale Metrics

Clustering coefficient (C) - The clustering coefficient, also known as transitivity, measures the presence of triangles in the network (Newman, 2001). A high clustering coefficient has been historically associated with social networks, where it means that "the friends of my friends are also my friends". Mathematically, it is defined as the relationship between the number of triangles in the network, i.e., a set of three nodes with one link between each pair. Motifs - In real world networks, it has been found that some sub-graphs, i.e., specific structures created by a small number of the original nodes, have a higher than expected probability of appearance. When such probability is statistically relevant, these sub-graphs are called motifs (Milo et al., 2002; Milo et al., 2004). In other words, motifs can be seen as Lego blocks, with which the complex network is constructed. The presence (or absence) of some motifs is tightly related with the function or process performed by the network. For instance, triangular motifs (i.e., representing a loop in the flow of information) are used to create short time memories in neural and protein interaction networks (Milo et al., 2002).

##### Macroscale Metrics

Average geodesic distance (l) - Considering a network as a transportation system, the number of links included in the path connecting two nodes is called the length of such a path. Of special interest is the shortest path, also called the geodesic, between two nodes, and defined as one of the paths connecting these two nodes with minimum length (notice that more than one geodesic path may exist). The average geodesic distance is defined as the mean of all possible geodesic distances. Efficiency (E) - A problem with the definition of the average geodesic distance is that it diverges when the network is not connected, i.e., there are nodes which are not reachable. Although this can be solved by only taking into account the group of strongly connected nodes, this also introduces distortions in the metric, and situations with two independent and separated communities of nodes are not correctly handled. To overcome this situation, the measure of efficiency was proposed, defined as the mean of the inverse of the distances (Latora et al., 2001). The efficiency assesses the performance of the network in sending information. Small-worldness (S) – this was described in Section 0.2.1. More formally, it is defined as (with both C and l normalized with respect to random networks) (Humphries et al., 2008). When S is greater than one, the network is considered to be ‘small-world’.

#### Research Challenges

Adaptation of metrics by means of a macroscale data analysis. As we have seen in Section 3.2.2.2, plenty of metrics have been developed in recent years to characterize the properties of complex networks, from a macro-scale, meso-scale and micro-scale point of view. Yet, most of this work has been performed through a top-to-bottom approach: that is, most metrics have been developed from a theoretical perspective, without considering the specificity of real-world applications. For instance, one can assess the connectivity of a network by means of its efficiency (see Section 3.2.2.2 for its definition). While this measure is meaningful for a communication system, its applicability to the air transport system is reduced. The number of links (i.e., flights) needed to go from one airport to another one is not enough to characterize the mobility of passengers, as many other elements (duration of flights, availability of connections, delays, different airlines and alliances) should be taken into account. Therefore, it is necessary to adapt the existing metrics to each problem studied by means of a macroscale data analysis, in order to improve the meaningfulness of obtained results.

### Evolutionary Properties

#### Problem Statement

The characterization of the topology and of the dynamics developing on top of a network, at both macro and micro-scales, is critical for the understanding of any complex system, with examples spanning from biological to social systems (Boccaletti et al., 2006). In the specific case of air transport, different network structures, ultimately given by the connections built by the airlines and other airspaces users, lead to a number of important properties, both inherent to the network and with implications in the way mobility as a service is provided to the final users. However, those macro and micro structures are far from being static. Different factors caused the network to evolve, varying significantly both the inherent properties as well as the mobility provided in Europe, e.g. seasonal variability in the short term, up to legislation changes in time windows of decades. In other words, the air transport system is not stationary, in that it is a system in a persistent out-of-equilibrium condition, whose characteristics are continuously varying with time. Due to this, the primary analysis of the topology and dynamics of the system should be complemented with information about the evolution of the same. This would provide additional information about the rate and type of the new connections of the air transport network, and therefore it would represent a better technique to produce a forecast. There are several aspects that should be taken into account: • The kind of relational data to be analysed, i.e., it is necessary to identify the type of network representation most relevant for the forecasting problem under study; • The identification of the scales (both temporal and spatial) characteristic of the evolutionary process; • Derived from the previous point, identification of the time windows in which the system can be approximated as stationary; • The choice of the best metrics to represent such evolution. Additionally, when forecasting the evolution of the network, different scenarios should be considered, requiring different data layers to be fed into the models towards determining the growth of the network. A given topological element can favour the evolution of the network towards certain structures: understanding how to extract information through complex data mining techniques supporting network analysis is the subject of this research line. It must be noted that such analysis is not restricted to the forecasting of traffic in the network. Beyond traffic growth, the analysis of the structure of the network could also provide insights into other metrics that are very relevant to the air transportation network, especially those that are non-trivially dependent on the dynamics of the system. For instance, a definition of mobility is associated with the concept of connectivity, but a coherent definition based on data is needed in this context; or forecasting fuel consumption, which is linked not only to traffic growth, but also to the number of connections that passengers need to make to arrive at their destination. Regarding this last example, the capability to describe the evolution of different network structures based on data could provide insights on the potentials for the industry to reduce fuel consumption while maintaining mobility, or, in general, the sensitivity of mobility to fuel consumption through an analysis of the connections evolution.

#### Literature Review

One of the most important problems tackled in the literature associated with complex networks is their evolution. Specifically, an important effort has been devoted to the understanding of the mechanisms and forces driving such evolution, both from a theoretical (see, for instance, Barabási et al., 1999) and from an applied point of view (Gautreau et al., 2009). The underlying idea is that different "natural" networks evolve, maximizing their chances of survival and adapting their configuration to the environment. Human-engineered networks have not necessarily grown through evolutionary adaptation, however, research indicates that the design principles could be similar (Shin et al., 2009). The evolution of the different properties for the air transport network, in particular link density (design cost) and efficiency (network congestion) have not been deeply researched on long temporal scales. While there are different algorithms to grow network structures to prevent congestion with minimum design costs, there are no evolutionary optimization models that provide guidance on the best structures for the transport network in its evolution. The air transport network presents a structure in which the centrality of hubs is not stable. Aircraft technology, state regulations and evolution of market demand are found to be the main variables that affect the centrality of hubs (Bowen, 2002). While globalisation and liberalisation fuelled the expansion of the network and its densification, it is not clear at this point what the regime is of the evolution of the network structure. Over decades, new business models, infrastructures, shifts in market demand, aircraft technology, international regulations (Bowen, 2002) and externalities, like fuel prices or airport subsidies, have caused the network to evolve, varying significantly both the inherent properties as well as the mobility provided in Europe. Additionally, historical "accidents" arising from geographical, political and economic factors have played an important role in the network evolution (Guimera et al., 2008). Connectivity and interconnectivity metrics are also evolving through time. Some metrics provide insights on the connectivity, like the Shimbel index, but important drawbacks lead to the need for new "effective affordability" indexes (Bowen, 2002).

#### Research Challenges

Optimizing the discretization of evolving data. Although most of the researchers use OAG as the main source of data, it is not clear what is the right time step to be chosen to compute the different indicators and its evolution. For instance, several authors pick few years (e.g. less than five) and group the data in two seasons to derive the appropriate metrics of the network. It is not clear if evolutions are to be considered within those time scales or analysing the evolution of the network requires longer time series. Several factors are considered in the literature as important influences in the network evolution, like new business models, infrastructures, shifts in market demand, aircraft technology, international regulations and externalities like fuel price, but there are not coherent datasets compiled to be considered as potential inputs for network evolution research. The influence of these factors in network structure evolution has been researched through correlations, but there is no coherent framework for this either.

Connectivity and interconnectivity. Modern network theory provides an interesting foundation for the study of transportation systems in general. However, it should be taken into account that this theory was initially developed from an abstract perspective, i.e., without the aim of modelling real-world problems. Therefore, most indicators, metrics or indexes currently available in the literature ignore fundamental aspects, such as, for instance, the connections schedule. Clearly, such elements are an essential part of transport networks, and therefore additional research needs to be done in order to evolve network theory indicators to incorporate such aspects. Additionally, while connectivity has been extensively studied (even if only from a theoretical point of view), interconnectivity, understood as paths connecting two nodes through a third, has not received enough attention, especially taking into account schedules. Both connectivity and interconnectivity evolution could provide powerful insights in to foresighting techniques.

## Case Studies

### Operational Safety from a Complex Data Analysis Perspective

Introduction Safety is one of the most important requirements for ATM. In spite of the importance of safety for ATM, there are two main challenges that should be tackled in the future. Firstly, there is an important lack of safety metrics. While most reports usually count the number of unsafe events, and how they have evolved from the last analysis, it is clear that this is not enough to really understand the safety characteristics of the system. This is because incidents and accidents are extremely rare events, and a good metric should take into consideration safe situations that may evolve into unsafe conditions. Second, plenty of information about the safety of the system should be encoded in historical operational data. For instance, patterns that may develop into unsafe situations; records of decisions taken by pilots and ATCOs; interactions between procedures and unexpected meteorological conditions; and so forth. In both cases, most of the relevant information is encoded in the relationships between the elements of the system, as is expected in a complex system. Therefore, complex data analysis appears as the natural solution to mine such relationships, and understand the operational consequences of them.

Example of a Safety Analysis In this example, we will just focus on the problem of the interactions between the trajectories of different aircraft; the aim is to recognize if such interactions may result in subsequent unsafe situations, and thus lay the basis of an intelligent prognosis tool for air traffic control. The idea of post-analysing historical ATM data, searching for patterns related with safety, has been already studied in several works, although results have not gone beyond some general considerations. Such analyses have been mainly based on standard statistical techniques, as independence assessment between air sectors and incident densities (Nazeri, 2006), and correlations between airspace global metrics and the occurrence of accidents and incidents (Nazeri et al., 2008). A step further was proposed by Tedeschi (2009), in which unsafe events are studied by means of complex networks. Here, we propose a different approach, as an example of complex data analysis applied to safety. Initial information will be composed of aircraft trajectories, i.e., of historical radar tracks and their corresponding flight plans. Within a given sector, the elements of the system that we may consider are aircraft, and relationships between them are given by trajectory changes. Specifically, we consider that an interaction has occurred when the trajectory of one aircraft has changed due to the presence of another aircraft (therefore, we are considering ATCO instructions to maintain separation between them). Such relationships can be seen as a generalized form of the Granger Causality (see Section 3.2.1.2); the change in the trajectory cannot be explained without considering the presence of a second aircraft, as this change was not included in the flight plan, and therefore indicates the presence of a causality between both of them. When this kind of relationship is assessed for several aircraft, for each pair of them, a network representation can be constructed. As already seen in Section 3.2.2, the global topology of this network can unveil new knowledge about the underlying system; in this case, chains of sequential ATCO interventions can be analysed, in order to detect the real source of the problem, and to design new procedures minimizing the intervention in the dynamics of the system.

### Driving Air Transport Network Topology and Dynamics

Introduction New approaches are clearly needed to better understand the long-term evolution of the Air Transport System. Mainly as a collateral consequence of its extraordinary success over the past 50 years, the system has grown so complex that, in a world which is also changing very fast, the wording which comes naturally to mind when stakeholders think about the future of air transport is: uncertainty, growing challenges and expectations, unforeseen behaviours and the like. Air transport has basically been an engineered system driven by carefully planned events (like the development of new aircrafts and the design and building of new infrastructures) which have induced and ultimately met an ever-growing social demand of mobility at affordable prices. Even its liberalization has been more a top-down issue, from decisions taken at the political level, than the result of uncoordinated actions by different types of stakeholders. In other sectors, the balance between profits and risks, investment decisions and uncertainty, self-sustaining innovations and market flops can be thought of as properly driven by (more or less free) market forces (notwithstanding the fact that, in almost any economic domain, the assessment on respective contributions by public institutions and free-market forces is still very much a subject of debate and controversy). Systemic complexity has grown dramatically for a number of reasons: the incredible expansion in past decades (the number of air travellers per year is roughly equal to 50% of the total world population), the number of actors involved, the highly sophisticated underlying technologies and the growing impact on many different social issues (not the least, on environmental concerns). On the other hand, relevant decisions (e.g. development of new aircraft technologies, planning of new airports or major changes in regulation) are still to be: a) grounded on active cooperation between public and private players, and b) taken with a long-term scale in mind, in the range of decades, because they mobilize huge investments over a long period of time. In such a dynamic environment, foreseeing future changes, especially for long-term investment decisions and regulation/policy making purposes, becomes increasingly complex, as it is unlikely that classical methods for looking into the future will still be the most appropriate tools for dealing with their complexity. Classical models of market demand analysis - or even market demand analysis constrained by boundary conditions that extrapolate system behaviour from the past to make informed decisions in the future- are likely to become insufficient when the underlying trends are moving. Work is needed in to innovative foresight methodologies to build some scientific foundations in how to make informed policy, planning and investment decisions.

Problem Statement Many elements can be considered as influences in air transport network evolution. The actual influence of those elements in the network structure and its evolution should be researched through the appropriate data mining and network analysis techniques, defining the leading and lagging indicators. Both leading and lagging indicators shall provide insights that could complement other forecasting techniques, like the time series analysis of market demand in different markets over different time scales. In order to build a foresighting framework, capable of providing insights on the best policies to be promoted in terms of network evolution, infrastructure growth, market demand stimulation, etc., a coherent framework to drive topology and dynamics for the air transport network needs to be built. This evolutionary framework shall be capable of informing policy makers, airlines and infrastructure operators on the best network evolution strategies to improve air transport mobility in Europe.

## Recent Developments

This section is devoted to describing recent research results that are relevant to the research theme of complex data analysis. If you have any related results that you wish to contribute, please feel free to add your contribution as a subsection below. Also consider linking your results with relevant portions of the main text of the article and/or other articles (e.g. related research lines) to increase its visibility and help give it a context inside the research theme.

### Multi-scale analysis of the European airspace using network community detection

#### Introduction

The European airspace can be viewed as a multi-scale network whose nodes can be airports, sectors, or navigation points; the traffic itself gives birth to naturally weighted edges between the nodes. Indeed, the European airspace is in fact a complex, hierarchical system which is partitioned for a series of reasons, mainly related to air traffic control.

At the highest level, the space is partitioned into multinational areas, termed Functional AirBlocks (FAB). The FABs are not yet fully implemented, but are planned in the near future as a mean to increase the capacity in terms of traffic. Then each country has its own National Airspace (NA), which is typically partitioned into several Air Control Centers (ACC). Each of these is itself partitioned into sectors, which are the smallest unit of control, being under the direct supervision of air traffic controllers. Finally, inside the sectors we find the navigation points, constituting the grid on which the flights move. In fact, nowadays, flight plans are defined as a set of consecutive fixed points that the aircraft is supposed to pass at predefined times. On the smallest scale, therefore, a flight plan is a path on a grid whose nodes are the navigation points. The choice of the boundaries of these multiple partitions is decided in a strongly supervised and not fully quantitative way, taking into account political and strategical reasons and also traffic considerations.

On the other hand, an important characteristic of a complex network is its organization in communities (clusters) (Fortunato 2010). Communities are generically defined as sets of nodes that are more connected among themselves than with the rest of the network. They are, therefore, an important element to understand and model the architecture of a network. In this paper we show how the community detection in networks provides information on the appropriateness of the airspace design at the various scales considered, based on the sole knowledge of the actual air traffic data. We show how methods devised for identifying communities in networks could be used to help designing the structure of airspace in a bottom-up way, i.e. starting from the observed behavior of the system. For this, we use several algorithms for unsupervised communities detection, and this analysis could be seen also as a “horse race” among these methods in order to find which works best when the investigated network describes a traffic flow (not necessarily aircraft, but also cars, data, etc.). These two aspects can not be kept fully distinct. Indeed, it is hard to distinguish whether the actual traffic flow is a consequence of the airspace partition or viceversa. As we mentioned above the airspace partition is due to different reasons sometimes unrelated with the effective air traffic needs.

#### Methodology

Based on traffic data provided by Eurocontrol (flight plans of all flights in European Airspace for 28 days), we built three networks:

• Airports
• Sectors
• Navpoints

We used three different methods for community detection, which are described more in detail in communities detection methods in ATM: modularity maximization (with the Louvain Algorithm), infomap and OSLOM.

Finally, in order to compare the partitions obtained with the different methods and the existing partitions, we used mainly two metrics, which are the rand index and the mutual information. Properly adjusted (against a null model) and normalized, they give a value between 0 (partitions totally distinct) and 1 (same partitions).

#### Results

##### Navpoint network

The navpoint network, which has been studied considered in (Cai 2012) to our knowledge, shows an exponential behavior for the distributions of degree, strength and betweenness centrality. The network is nearly planar and very sparse: it is a geographically embedded network with no big "shortcuts" between nodes.

The following figure shows an example of the result of an unsupervised community detection, here with OSLOM. As one can see, the communities are "geographical" in the sense that there are no geographical overlap between them. This is a consequence of the structure of the network, geographically embedded.

Figure showing the result of an unsupervised community detection on navpoint network with OSLOM algorithm

The different methods of communities detection reveal different levels of description of the system. The table shows for instance the number of communities detected by the algorithms, as well as the ones present in existing partitions: zones of control (ACC) and national airspaces (NA).

Infomap Modularity OSLOM ACC NA
598.0 42.8 150.0 75.5 46.0

With some proper comparisons, one finds that the best algorithms are OSLOM for the ACC, with a mutual information of 0.59 in average, and the maximization of modularity for the NA, with 0.57 as averaged mutual information. The infomap partition could in fact be related to the division in sectors, but it cannot be checked straightforwardly with this network of navpoints, because navpoints are in fact crossing several sectors in altitude, thus possibly belonging to several sectors at the same time.

Since the mutual information is high but not equal to 1, it gives space to some improvements. Policy makers could base their decision of this new partitions when redesigning the airspace.

##### Sector network

The sector network is quite similar to the previous one, because it is also geographically embedded. Indeed, sectors can not span a huge geographical area, cannot have a high number of neighbors, and thus no hubs are present. Indeed, we find that the distributions are roughly exponential, except for some outliers corresponding to peculiar sectors (in Germany).

The communities detection are now giving a number of communities displayed in the following table.

Infomap Modularity OSLOM ACC NA FAB
66.7 12.6 49.9 115.4 49.5 12.0

Once again, the different methods correspond to different levels of description. While the modularity maximization seems to be related to the FAB, OSLOM is closer to NA and Infomap to ACC. When comparing the partitions more closely with the mutual information, we find that indeed modularity maximization is the best description for FAB (0.54), whereas OSLOM is the best for NA (0.53), although not by far compared to the others, and that Infomap and OSLOM seems to be as close to the ACC partitions.

Once again, the fact that the Mutual Information scores do not reach 1 indicates that there is room for improvements of in the design of the airspaces, at far as the traffic is concerned.

##### Airport network

The airport network have been studied extensively. It is a small world, with power law distributions for degree and strength for instance, as we have checked with the data.

The communities themselves (figure 6 of the article) display an unexpected geographical behavior. Even if this kind of network has hubs, is subject to rich-club effect and so on, the communities detected are surprisingly non-overlapping. Moreover, some communities are clearly very consistent with national borders (or geographical entities, like United Kingdom and Ireland).

The different algorithms give once again different partitions, see the table below.

Infomap Modularity OSLOM NA FAB
24.5 9.4 16.3 42.0 12.0

A precise analysis with the mutual information shows that the best description of the FAB is given by the maximization of modularity (0.53 for mutual information) and that OSLOM gives a good description of the NA (0.45).

The conclusion for the airport network cannot be the same than for the other ones. First, because the airports are not operated entities in the sense of the ATM, on the contrary of FAB for instance even though, as a first approximation, one can consider that the traffic between airports give the main flow for the airspace. Second point, the airports can have long range interactions, and thus be a priori in the same community, while in two different countries which obviously cannot be in the same FAB. Hence, it is important to highlight these long range interactions and see how they are interfering with the rather “compact” communities we found in the network of airports. Hence, we have completed this analysis with a slightly different method.

##### Airport network -- Extracting the distance bias

In transportation systems like this one, nodes tend to be more connected to their geographical neighbors, just because they are close to each other. But of course there are other non local causes to the formation of communities. For instance, Ryanair has dedicated airports all over Europe, which are more likely to be connected. Hence, Beauvais in France (close to Paris) or Ciampino in Italy (close to Rome) are more likely to be connected to each other than to Fiumicino or Charles de Gaulle, respectively. Nationalities and common languages can also be valid causes.

A way of capturing this phenomenon and avoiding the distance bias is based on the method described in Expert 2011. The idea is to maximize modularity by using a null hypothesis that takes into account the geographical distance. Note that the Louvain algorithm cannot be used anymore with a mull model taking explicitly the coordinates of nodes into account. In this paper we used a simulated annealing algorithm to find the best modularity.

The figure 7 of the paper shows the result of the method. The picture is now very different. The communities are much less geographical constrained and long range interactions between airports are enhanced. For instance, the European airports operated mainly by Ryanair are now in the same community (displayed in salmon on the map). These interactions between these airports cannot be explained by geographical distance between them but only by the fact that they are operated by the same company. This community is not in the previous figure 6 and has been detected only by removing the distance bias.

In the same line, we can notice the specialization of airports surrounding big cities. Paris Orly for instance is in the same community than many Spanish airports, which is consistent with its actual role. Paris Charles De Gaulle is more linked to the central and eastern Europe and Beauvais, as noticed before, is operated only by Ryanair and is in its network. Around London each airport is also in a different community. Luton is more linked to other English airports, whereas Heathrow is linked to Italy and Portugal, Gatwick and the London City Airport to airports located in France and Switzerland mainly, and Stansted is in the Ryanair network.

Let us emphasize also that geographical clusters did not disappear completely. Spain, United Kingdom, Italy, Greece and Scandinavian countries have their own community, despite the spatial null model. This might be due to national characteristics (language, common firms, etc.) which are not directly concerned by distance.

#### Conclusions

At each scale the community detection algorithms provide useful insights on the system. In the case of the navpoint network we have obtained that the modularity partition gives big communities with sizes comparable to the NAs, while the OSLOM one gives a finer clusterization, close to the ACCs. The partitioning obtained by using the In- fomap algorithm is close to reproduce the air traffic sectors. In the case of sector network the Infomap and OSLOM algorithms provide partitions which are quite close to the ACCs. For the airport network, the considered algorithms provides a partition of the airspace close to the one of FABs. When the community detection algorithm is able to take into account the geographical constraints, then some long distance interactions between airports emerge. These are not explained by geographical proximity, and cannot be strictly linked to general operational issues. Rather, they might reveal strategies operated by specific airlines, such as, for example, Ryanair. In fact, Ryanair has not a business model based on large hubs. Rather, flight plans are scheduled without explicitly considering the presence of connected flights.

Our study is an example of how the community detection in networks provides an effective feedback on the appropri- ateness of the airspace design at the various scales considered, based on the sole knowledge of the actual air traffic data. It is important to emphasize that Europe (as well as the US) is moving toward a new scenario of air traffic management according to an ambitious and long term project termed SESAR aiming at changing the architecture of the European airspace based on a new set of paradigms (SESAR 2007). In this new scenario, cross national control units will be defined and they will be based more significantly on traffic demand than on national constraints. In this framework our bottom-up approach for the partitioning of the European airspace with community detection algorithms could be used to improve the design of this important transport infrastructure.

Finally, we also believe that this approach could be fruitfully adopted in other types of traffic networks (De Montis 2010).

## References

• Bandt, C., Pompe, B., 2002. Permutation entropy: a natural complexity measure for time series. Phys. Rev. Lett. 88, 174102.
• Bahraminasab, A., Ghasemi, F., Stefanovska, A., McClintock, P. V. E., Kantz, H., 2008. Direction of Coupling from Phases of Interacting Oscillators: A Permutation Information Approach. Phys. Rev. Lett. 100, 084101.
• Barabási, A.-L. and Albert, R., 1999. "Emergence of Scaling in Random Networks", Science 286 (5439), pp. 509-512.
• Barrat, A., Barthélemy, M., Vespignani, A., 2005. The effects of spatial constraints on the evolution of weighted complex networks. J. Stat. Mech. P05003.
• Boccaletti, S., Latora, V., Moreno, Y., Chavez, M., Hwang, D.-U., 2006. Complexnetworks: Structure and dynamics. Physics Reports 424 (4–5), pp. 175–308.
• Bowen, J., 2002. Network change, deregulation and access in the global airline industry. Economic Geography 78 (4), pp. 425-439.
• Cai KQ, Zhang J, Du WB, Cao XB (2012) Analysis of the Chinese air route network as a complex network. Chin. Phys.

B 21: 028903.

• Costa, L. da F., Rodrigues, F. A., Travieso, G., Villas Boas, P. R., 2007. Characterization of complex networks: A survey of measurements. Advances in Physics 56 (1), pp. 167-242.
• De Montis A, Caschili S, Chessa A, 2010. Time evolution of complex networks: commuting systems in insular Italy. J

Geogr Syst 13: 49-65.

• Fortunato S, 2010. Community detection in graphs. Physics Reports 486: 75-174.
• Granger, C.W.J., 1969. Investigating causal relations by econometric models and cross-spectral methods. Econometrica 37 (3), 424--438.
• Guiasu, S., 1977. Information Theory with Applications. McGraw-Hill, New York.
• Guimerà, R., Mossa, S., Turtschi, A., Amaral, L. A. N., 2005. The worldwide air transportation network: Anomalous centrality, community structure, and cities’ global roles. PNAS 102 (22), pp. 7794-7799.
• Hoover, K.D., 2001. Causality in macroeconomics.Cambridge University Press.
• Humphries, M. D., Gurney, K., 2006. Network "Small-World-Ness": A Quantitative Method for Determining Canonical Network Equivalence. PLoS ONE 3 (4), e0002051.
• Latora, V., Marchiori, M., 2001. Efficient behaviour of small-world networks. Physics Review Letters 87, 198701.
• Matilla-García, M., Ruiz Marín, M., 2008. A non-parametric independence test using permutation entropy. Journal of Econometrics 144, 139-155.
• Milo, R., Shen-Orr, S., Itzkovitz, S., Kashtan, N., Chklovskii, D., Alon, U., 2002. Network motifs: simple building blocks of complex networks. Science 298, pp. 824-827.
• Milo, R., Itzkovitz, S., Kashtan, N., Levitt, R., Shen-Orr, S., Ayzenshtat, I., Sheffer, M., Alon, U., 2004. Superfamilies of evolved and designed networks. Science 303 (5663), pp. 1538-1542.
• Nazeri, Z., 2006. Data Mining of Air Traffic Control Operational Errors. 2006 International Conference on Data Mining, Las Vegas, Nevada, USA.
• Nazeri, Z., Barbara, D., De Jong, K., Donohue, G., Sherry, L., 2008. Contrast-Set Mining of Aircraft Accidents and Incidents. Lecture Notes in Computer Science, 5077, pp. 313--322.
• Newman, M. E. J., 2001. Scientific collaboration networks: I. Network construction and fundamental results. Physical Review E 64, 016131.
• Newman, M. E. J., 2003. The Structure and Function of Complex Networks. SIAM Review 45 (2), pp. 167-256.
• Ras, Z. W., Tsumoto, S., Zighed, D. A., 2007. Mining Complex Data. Springer.
• SESAR, "Definition of the future atm target concept - d3" 2007.
• Shannon, C. E., 1948. A mathematical theory of communication. Bell System Technical Journal 27, pp. 379-423.
• Shin, S.-Y., Namatame, A., 2009. Evolutionary Optimized Networks and their properties. International Journal of Computer Science and Network Security 9 (2).
• Stam, C. J., van Dijk B. W., 2002. Synchronization likelihood: an unbiased measure of generalized synchronization in multivariate data sets. Physica D 163 (3), 236--251.
• Strogatz, S. H., 2001. Exploring complex networks. Nature 410, pp. 268-276.
• Tedeschi, A., 2009. Coupling and Complexity of Interaction of STCA Networks. 8th Innovative Research Workshop & Exhibition, EUROCONTROL Experimental Centre Brétigny-sur-Orge, France.
• Wang, B., Tang, H., Guo, C., Xiu, Z., 2006. Entropy optimization of scale-free networks robustness to random failures. Physica A 2 (363), pp. 591-596.
• Zanin, M., Zunino, L., Rosso, O. A., Papo, D., 2012. Permutation Entropy and Its Main Biomedical and Econophysics Applications: A Review. Entropy 14 (8), pp. 1553-1577.
• Zighed, D. A., Tsumoto, S., Ras, Z. W., Hacid, H. 2009. Mining Complex Data. Springer-Verlag, Berlin.