Metodologia
Compartir en redes sociales
Compartir enlace
Use permanent link to share in social mediaCompartir con un amigo
Por favor iniciar sesión para enviar esto document ¡por correo!
Incrustar en su sitio web
2. Contents 1 Introduction 3 2 Structure of the data and notation for the regions 4 3 Regional manufacturing indicators 7 3.1 Specialization and concentration: basic concepts . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.1 General approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 3.1.2 An additional note on international comparisons . . . . . . . . . . . . . . . . . . . . 9 3.1.3 Local approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9 3.2 Further indicators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 4 Regional manufacturing structure: simultaneous grouping of regions and activities 14 4.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 4.2 Singular value decomposition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 4.3 Best collapsed table: the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 5 Identication of specialized agglomerations 20 5.1 Notation for clusters of regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20 5.2 A structural model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 5.3 The concept of specialized cluster in a structural model . . . . . . . . . . . . . . . . . . . . 23 5.4 Detecting a specialized agglomerations scheme . . . . . . . . . . . . . . . . . . . . . . . . . 24 5.5 Testing the role of contiguity for specialized agglomerations . . . . . . . . . . . . . . . . . . 26 5.6 The main parameters of the algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 5.7 Manufacturing competitiveness . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28 6 A retrospective view 29 References 30 2
1. Methodological framework for the Analysis of Industrial Geographical Data . Appendix for \La Geografa Industrial de America Latina y el Caribe". y (v150924) Christian HAEDO a and Michel MOUCHART b a Centro Interuniversitario de Investigaciones sobre Desarrollo Economico, Territorio e Instituciones (CIDETI), Alma Mater StudiorumUniversita di Bologna in the Argentine Republic b Institut de Statistique, Biostatistique et Sciences Actuarielles (ISBA) and Center for Operations Research and Econometrics (CORE), UCLouvain, Belgium MIALC project commissioned by EULAC Foundation, Hamburg, Germany. y This document has greatly beneted from comments of Dominique Peeters, Center for Operations Research and Econo metrics (CORE), UCLouvain, Belgium. 1
3. 1 Introduction This report provides some hints that should help the reader understanding methodological issues underlying the treatment of areal data for the elaboration of regional economic policy. Economic activity is spatially concentrated. Spatial concentration generates agglomeration economies, notably upstreamdownstream linkages, which help rms become more productive. These positive eects involve a critical mass of workers and infrastructure, and dense networks of suppliers and collaborators. The central role of agglomeration economies on the spatial structure of the economy has inspired a large literature focused on trying to understand their causes or origins and the dynamic of these economies, as well as most adequate industrial policies for their consolidation and promotion wherever the industrial agglomerations are weak or nonexistent. It has been recognized that the industrial policy objectives can be better fullled if they are more sensitive to places and sectors in design and delivery (Donato, 2002; Nathan and Overman, 2013). Agglomeration economies may appear in dierent geographical scales and may involve dierent disag gregation levels, and consequently, a certain space scale is not necessarily equivalent to another (Arbia, 1989, 2001; Krugman, 1991; Arbia and Espa, 1996; Anas et al. , 1998; Duranton and Overman, 2005; Arbia et al. , 2008, among others). In this report, we analyze data referring to a discrete space (lattice or areal data regarding administrative divisions), and the boundaries cannot be ignored, given that economic conditions can change abruptly due to changes in the tax system, in transportation costs, or to the impact of public policies at regional and sectorial levels. The work underlying this report handles issues related to a discrete space, i.e. a space partitioned into a nite number of regions, along with a nite number of economic activities. The basic data have the form N ij representing the a number of statistical units for a region i and an activity j . The labels i of the regions are arbitrary and incorporate no information neither on spatial contiguity nor on distance among regions. At a rst stage, this analysis is \spaceless" and motivated by policymaking rather than by spatial diusion issues. Indeed, the data N ij provides no information about the localization of primary units within a region. At a second stage, problems of agglomerations, i.e. of clusters of contiguous regions, are introduced but these problems require additional data related to the distance, or contiguity, between the regions. This topic is the object of the last section of this report. The results of most analyses are shown in Choropleth maps which enclose additional information in tables and graphics that are useful as a description and helps in understanding the dierent indexes and algorithm outputs. A Choropleth map or an areavalue map is one of the most frequently used maps in geography (Robinson et al. , 1995) which reveals data patterns by showing the distribution of a chosen phenomenon within the selected area. In order to construct a Choropleth map, data is aggregated into classes that are represented in the map by shades of color. The greater the density of the color, the greater the density or value represented. While such generalization may undercover some details, it allows a quicker observation of patterns and variation, and provides a basis for posing analytical questions. 3
7. 3 Regional manufacturing indicators The evaluation of regional characteristics and performances can be facilitated by a reference to dierent indices. In the sequel, we shall use in particular the following ones. The data classication method adopted to construct the Choropleth maps is the FisherJenks optimiza tion method, also called the Jenks Natural Breaks (Jenks, 1967). Jenks' method is the onedimensional version of Kmeans clustering (Forgy, 1965). This data clustering method calculates groupings of data values based on the data distribution, seeking to reduce the variance within groups and maximize the variance between groups. The advantage of this classication is that it identies real classes within the data. For more details about cluster analysis see Hartigan (1975), Kaufman and Rousseeuw (2005), Gan et al. (2007), Everitt et al. (2010), Aggarwal and Reddy (2013), among many others. 3.1 Specialization and concentration: basic concepts 3.1.1 General approach The practitioner may like to be reminded that the analysis of specialization and of concentration raises, at the conceptual level, two dierent issues to be carefully distinguished. One issue consists in characterizing the dispersion, or concentration, of distributions on unordered categorical variables such as regions or activities. In the theory of information it is established that the distribution of maximal dispersion, or minimal concentration, is the uniform distribution; for instance, on the regions it would be: P ( i ) = 1 I i = 1 ; 2 ; I . A measure of dispersion may accordingly be obtained though a measure of the discrepancy ( i.e. a distance or a divergence, see below) between the uniform distribution, taken as a benchmark of maximal dispersion, and the distribution of interest. A possible alternative approach is to introduce an order, \natural" or articial, among the categories and to rely on methods based on Lorentz curve and Gini coecients. In most cases, however, the ordering is dened only relatively to the distribution to be analyzed and is therefore not intrinsic to the set of possible categories. In this book, we generally give preference to the rst approach. Another issue is to focus the attention on one among three possible levels of analysis. A rst level, to be called an absolute concept (of specialization or of concentration), analyzes the dispersion of a distribution in itself without reference to the contingency with another variable. We then obtain absolute concepts of specialization or of concentration, either for a whole country or for a specic region i or a specic activity j . A second level, to be called a relative concept (of specialization or of concentration), confronts the conditional distribution of the activities within a region with the marginal distribution of the activities for the country, taken as a benchmark of no relative specialization, or the conditional distribution of the regions for a given activity with the marginal distribution of the regions for the country, taken as a benchmark of no relative concentration. A third level, to be called a global concept confronts the actual bivariate distribution, on regions activities, with the product of their marginal distributions that represents 7
33. Perroux, F. (1950), Economic space: theory and applications. Quarterly Journal of Economics 64 : 89104. Robinson, A.H., Morrison, J.L., Muehrcke, P.C., Jon Kimerling, A., and Guptil, S.C. (1995), Elements of Cartography . New York: John Wiley & Sons. Rota, GC. (1964), The number of partitions of a set. American Mathematical Monthly 71 : 498504. Sloane, N.J.A. (2001), Bell numbers. In Hazewinkel, M. (Ed.), Encyclopedia of Mathematics . New York: Springer. Thompson, J.F., Soni, B.K., and Weatherill, N.P. (1999) (eds.), Handbook of Grid Generation . Boca Raton, Florida: CRC Press. Ward, J. H., Jr. (1963), Hierarchical grouping to optimize an objective function. Journal of the American Statistical Association 58 : 236244. Zoellner, I., and Schmidtmann, I. (1999), Empirical studies of cluster detection: dierent cluster tests in application to German cancer maps. In A. Lawson, A. Biggeri, D. Bohning, E. Lesare and JF. Viel (eds.), Disease Mapping and Risk Assessment for Public Health . Chichester: John Wiley. 33
31. Branson, D. (2000), Stirling numbers and Bell numbers: their role in combinatorics and probability. Mathematical Scientist 25 : 131. Donato, V. (2002), Polticas publicas y localizacion industrial en Argentina. CIDETI Working Paper 2002/01, Alma Mater StudiorumUniversita di Bologna in the Argentine Republic. Duranton, G., and Overman, H. (2005), Testing for localization using microgeographic data. Review of Economic Studies 72 : 10771106. Everitt, B., Landau, S., Leese, M., and Stahl, D. (2010), Cluster Analysis . Chichester: John Wiley & Sons. Florence, P. (1939), Report of the Location of Industry . Political and Economic Planning, London, UK. Forgy, E.W. (1965), Cluster analysis of multivariate data: eciency versus interpretability of classi cations. Biometrics , 21 : 768769. Gan, G., Ma, C., and Wu, J. (2007), Data Clustering: Theory, Algorithms, and Applications . Philadel phia: ASASIAM Series on Statistics and Applied Probability. Gardner, M. (1978), The Bells: versatile numbers that can count partitions of a set, primes and even rhymes. Scientic American 238 : 2430. Good, P. (2000), Permutation Tests A Practical Guide to Resampling Methods for Testing Hypotheses . New York: SpringerVerlag. Greenacre, M. J. (2007), Correspondence Analysis in Practice . Boca Raton, Florida: Chapman & Hall/CRC. Greenacre, M. J. (2011), A simple permutation test for clusteredness. Working Paper 555, Barcelona GSE Working Paper Series. Guimar ~ aes, P., Figueiredo, O., and Woodward, D. (2003), A tractable approach to the rm location decision problem. Review of Economics and Statistics 84 : 201204. Guimar ~ aes, P., Figueiredo, O., and Woodward, D. (2009), Dartboard tests for the location quo tient. Regional Science and Urban Economics 39 : 360364. Haedo, C. (2009), Measure of Global Specialization and Spatial Clustering for the Identication of \Spe cialized" Agglomeration. Ph.D. thesis, Bologna: Dipartimento di Scienze Statistiche \P.Fortunati", Universita di Bologna (I). 31
30. The nal bootstrap step of AgA has been used to analyze more in depth the role of contiguity in the formation of agglomeration, in particular for evaluating the presence of intra or interregional localization economies. The agglomerations identied, for a given activity, by AgA have been compared in terms (i) of the dimension of rms, (ii) of the distribution of activities within the agglomeration, (iii) of the intensity of the specialization of the agglomeration, in terms of the average of the location quotients LQ ij of the agglomeration or in terms of the global LQ ij of the agglomeration or of the log of the location quotient weighted by the corresponding N ij , (iv) of the other activities where the agglomeration is also specialized . Given that the set of activities is the same for every country, BcA allows one to compare the best collapsed tables for two countries following one of two complementary strategies. A rst one considers every g  region with all activities and compute the discrepancy between the activity distributions for every pair of g regions. A second strategy is based on an ordering of the activities for each g region and evaluates a rank correlation for the activities of every pair of g regions. The ordering of the activities may be either the order of the weight of the activities or the intensity of the specialization of each activity within each g region. References Aggarwal, C., and Reddy, C. (2013) (eds.), Data Clustering: Algorithms and Applications . Boca Raton, Florida: Chapman & Hall/CRC. Anas, A., Arnott, R., and Small, K. (1998), Urban spatial structure. Journal of Economic Literature 36 : 14261464. Arbia, G. (1989), Spatial Data Conguration in Statistical Analysis of Regional Economic and Related Problems . Dordrecht: Kluwer. Arbia, G. (2001), The role of spatial eects in the empirical analysis of regional concentration. Journal of Geographical Systems 3 : 271281. Arbia, G., and Espa, G. (1996), Statistica Economica Territoriale . Padova: CEDAM. Arbia, G., Espa, G., and Quah, D. (2008), A class of spatial econometric methods in the empirical analysis of clusters of rms in the space. Empirical Economics 34 : 81103. Bertin, J. (2010), Semiology of Graphics: Diagrams, Networks, Maps . Redlands, CA: Esri Press. Besag, J., and Newell, J. (1991), The detection of clusters in rare diseases. Journal of the Royal Statistical Society 154 : 143155. 30
6. Clearly: X u 2U x u ij = N ij (9) Cartographic data Country maps cover three levels of administrative and political boundaries: national administrative bound aries and rst and second levels of subnational administrative boundaries (for more details about the administrative and political boundaries of each country used in this project, see \Fuentes de datos/Data sources" in the bar menu of the project's website). Manufacturing activities The manufacturing activities for each country have been homologated in 21 divisions (Table 1 ) of the International Standard Industrial Classication of All Economic Activities (ISIC) Revision 4 of the United Nations ( http://unstats.un.org/unsd/cr/registry/regcst.asp?Cl=17&Lg=1) as follows: Table 1: Homologated manufacturing activities Divisions Description ISIC Rev.4 10 11 Manufacture of food products; Manufacture of beverages 12 Manufacture of tobacco products 13 Manufacture of textiles 14 Manufacture of wearing apparel 15 Manufacture of leather and related products Manufacture of wood and of products of wood and cork, except furniture; 16 manufacture of articles of straw and plaiting materials 17 Manufacture of paper and paper products 18 Printing and reproduction of recorded media 19 Manufacture of coke and rened petroleum products Manufacture of chemicals and chemical products; Manufacture of basic 20 21 pharmaceutical products and pharmaceutical preparations 22 Manufacture of rubber and plastics products 23 Manufacture of other nonmetallic mineral products 24 Manufacture of basic metals 25 Manufacture of fabricated metal products, except machinery and equipment 26 Manufacture of computer, electronic and optical products 27 Manufacture of electrical equipment Manufacture of machinery and equipment n.e.c.; Repair and installation 28 33 of machinery and equipment 29 Manufacture of motor vehicles, trailers and semitrailers 30 Manufacture of other transport equipment 31 32 Manufacture of furniture; Other manufacturing 38 Waste collection, treatment and disposal activities; materials recovery 6
32. Haedo, C., and Mouchart, M. (2012), A stochastic independence approach for dierent measures of concentration and specialization. CORE Discussion Paper 2012/25, UCL LouvainlaNeuve (B). http://www.uclouvain.be/cps/ucl/doc/core/documents/coredp2012_25web.pdf Haedo, C., and Mouchart, M. (2014), Automatic grouping of regions and activities Alias: Best Collapsed Tables , in progress. Haedo, C., and Mouchart, M. (2015), Specialized Agglomerations with Lattice Data: Model and Detection. Spatial Statistics 11 : 113131. Hartigan, J.A. (1975), Clustering Algorithms . New York: John Wiley & Sons. Jenks, G.F. (1967), The data model concept in statistical mapping. International Yearbook of Cartog raphy 7 : 186190. Jobson, J. (1992), Applied Multivariate Data Analysis. Volume II: Categorical and Multivariate Methods . New York: SpringerVerlag. Kaufman, L., and Rousseeuw, P. J. (2005), Finding Groups in Data. An Introduction to Cluster Analysis . Hoboken, New Jersey: John Wiley & Sons. Krugman, P. (1991), Increasing returns and economic geography. Journal of Political Economy 99 : 483499. Lawson, A. (2006), Statistical Methods in Spatial Epidemiology . New York: Wiley. Liseikin, V.D. (2010), Grid Generation Methods . Dordrecht: Springer. Manly, B. (1991), Randomization and Monte Carlo methods in biology . London: Chapman & Hall n CRC. Mardia, K., Kent, J., and Bibby, J. (1979), Multivariate Analysis . London: Academic Press. Moineddin, R., Beyene, J., and Boyle, E. (2003), On the location quotient condence interval. Geographical Analysis 35 : 249256. Mori, T., and Smith,T. (2011), An industrial agglomeration approach to central place and city size regularities. J. Reg. Sci. 51 : 694731. Nathan, M., and Overman, H. (2013), Agglomeration, clusters, and industrial policy. Oxford Review of Economic Policy 29 : 383404. O'Donoghue, D., and Gleave, B. (2004), A note on methods for measuring industrial agglomeration. Regional Studies 38 : 419427. O'Sullivan, D., and Unwin, D.J. (2010), Geographic Information Analysis . New Jersey: John Wiley & Sons. 32
4. It is well known that when the areas or regions are not uniform, as in the case of this project which considers the dierent levels of administrative and political boundaries of each country, the Choropleth map fails to equate the visual importance of each region with its geographic area in comparison to a value indicator, giving sparsely areas great visual emphasis. This limitation can be solved by using the method of mesh/gridsquare mapping (dividing the map into equal sized units/squares and then color each one according to the data being encoded), or by using dot grid maps (overlying a regular grid of circles in which each is sized accordingly to the value of the region the circle falls into), among others (for more details about these topics see Thompson et al. 1999; Bertin 2010; Liseikin 2010). In the following stages of the project it is foreseen to incorporate the aforementioned grid methods as an optional visualization of the results. Next section gives a (slightly formal) description of the data to be analyzed along with some notational convention. Section 3 presents an overview of the main topics that are studied in this Atlas, namely the basic concepts around the regional specialization and the industrial concentration. This section is completed by introducing dierent indices useful for characterizing the manufacturing structure of the regions. The following section brie y explains the method to analyze the Regional Manufacturing Structure of the countries by a simultaneous grouping of regions and activities algorithm; the interested reader may nd a more complete exposition in Haedo and Mouchart (2014). The last section brie y sketches a denition and a method for detecting Specialized Agglomerations, again the interested reader may nd a more complete exposition in Haedo and Mouchart (2015); and ends with the explanation of the Manufacturing Competitiveness index. 2 Structure of the data and notation for the regions This report treats data that refer to dierent countries of Latin America and the Caribbean. All these data share a similar structure. More specically, for a given country, let us consider a nite set of administrative regions i 2 I = f 1 ;:::;I g , and a nite set of activities j 2 J = f 1 ;:::;J g . The administrative nature of the regions refers to two aspects. Firstly, the number of regions is nite. Secondly, the boundaries of the regions are designed exogenously, i.e. independently of the problem under analysis; in particular the areas of the dierent regions are typically quite heterogeneous. Furthermore, this administrative nature of the regions is crucial for the availability of the data. It is to be noted that the labels, i 2 I and j 2 J , are \not informative", meaning that the labeling system is arbitrary and does not embody any information. For each pair ( i;j ) 2IJ , we observe the number N ij of primary units; these could be typically a number of manufacturing employees or a number of manufacturing economic units. Thus we obtain a twoway I J contingency table N = [ N ij ] that also produces row, column and table totals denoted as follows: N i = X J j =1 N ij ; N j = X I i =1 N ij ; N = X I i =1 X J j =1 N ij = X J j =1 N j = X I i =1 N i (1) 4
26. Step 1. The rst step chooses the region which forms a separated one region cluster and that maximizes the value of BIC ( X j j C ). There are I possible regions to choose from. For each cluster scheme under consideration M ( C ) = 2 as the outcome of this rst step is a two clusters conguration, namely one cluster formed by the chosen region and a second cluster consisting of all the remaining regions. The second cluster is a \residual" cluster in the sense that it is not composed of homogenous regions only, as far as specialization is concerned, but rather is to be used as a reservoir from which a new region will be extracted in the following step. This step is completed by evaluating BIC ( X j j C [1] ). If BIC ( X j j C [1] ) = BIC ( X j j C [0] ), the algorithm stops. Step 2. The second step looks for a second region, to be chosen from the I 1 remaining regions of the previous step, by maximizing the conguration criteria. Therefore, the outcome of C [2] will depend on the cluster conguration of the previous step C [1] . At most three clusters congurations ( M ( C ) = 3) should have been formed: i) one cluster with the region chosen at the rst step; ii) another cluster with the region chosen at the second step; and iii) a third cluster with the remaining regions. If however the two regions chosen at the rst two steps are contiguous then the number of clusters is two ( M ( C ) = 2): one with the two chosen regions and the other one with the remaining regions. If the maximizer does not improve the BIC criterion of the previous step, the algorithm stops. Next steps. Each following step of the algorithm proceeds according to the same structure as that of step 2. Note that some agglomerations might be contiguous without being merged into a unique one because of too dierent values of the location quotients: this is decided through the BIC criterion that balances the eect on the divergence against the penalization for the number of parameters, see equations ( 78 ) and ( 83 ). Stopping rule. The algorithm stops when no choice of region from the residual cluster of the preceding step provides an increase in the criterion. If we write k for the last step before stopping the algorithm, we simplify the notation by writing C instead of C [ k ] . As such, this stopping provides a \myopic" algorithm that stops once the objective decreases for a rst time. The actual stopping rule of the algorithm is completed so as to protect against the possibility of local optima; more detail are given in Haedo and Mouchart (2015). Finally, the Choropleth map of each country shows the overspecialized agglomerations resulting of the optimal cluster scheme of each activity j separately at last available period. 5.5 Testing the role of contiguity for specialized agglomerations Consider a cluster scheme that has been observed, or determined, for a given activity j for which there is some degree of industrial concentration, as measured by an index of relative industrial concentration of 26
29. In the composite Mc [ i ] index, the indicator Mp [ i ] may be reclassied as follows: 2 = when Mp [ i ] = 1 or Mp [ i ] = 3; 1 = when Mp [ i ] = 2 or Mp [ i ] = 5; 0 = when Mp [ i ] = 4 or Mp [ i ] = 6. Finally, every indicator of the composite Mc [ i ] , generically named X [ i ] , may be normalized as follows: N ( X [ i ] ) = X [ i ] min f X [ i ] g max f X [ i ] g min f X [ i ] g 100 (87) Thus, each indicator is bounded between 0 and 100 points, while the Mc [ i ] is bounded between 0 and 700 points. The Choropleth map of each country shows the values of Mc [ i ] at last available period, aggregated into three classes: high, medium and low, using the Jenks Natural Breaks classication method. For international comparisons, the Mc [ i ] has been computed for the global country simply dividing it by the number of regions i with al least one economic unit. 6 A retrospective view This Atlas gathers a set of several contributions. As a rst step, data are collected for a number of countries of this continent. These data refer basically to the employment and the rms in the manufacturing sectors. This process is continuously ongoing as long as data for new countries are still in the process of collection. Then, the collected data are scrutinized for evaluating the design of the collecting process and thereafter summarized by means of a set of (rather standard) descriptive measures and of illustrative maps. This Atlas also makes original contributions in the eld of detecting relatively specialized groups of regions and/or activities, for which two innovative algorithms are used. A rst algorithm, to be called the Agglomerative Algorithm (AgA), constructs a partition of the regions into agglomerations, i.e. clusters of contiguous regions, characterized by a property of relative specialization with respect to a particular activity. A second algorithm, to be called a Biclustering Algorithm (BcA), proposes a simultaneous regrouping of regions and activities with an objective of a better synthetic view of the regional manufacturing structure of an economy. These contributions have lead to a number of insightful developments. Of particular interest is com paring the spatial structure of a given activity in dierent countries or of dierent activities in a given country. AgA has been shown to be particularly well suited for comparing the spatial structure of a given activity in dierent countries or of dierent activities of a given country: 29
25. Our nal objective is to construct a cluster scheme C made of specialized agglomerations, that maxi mizes the heterogeneity among agglomerations up to a penalization on the number of parameters. In this context, a set of agglomerations is \best" (or, close to best) as long as the agglomerations are the most dierent possible for their specialization with respect to activity j . Here, the heterogeneity among clusters is measured by the divergence d ( p ~m j j j g ~m ), as given in equation ( 78 ). It should be noticed that a particular cluster scheme generates a particular model; indeed, from ( 65 ), a cluster scheme C may be viewed as a model of agglomeration formation. Therefore, selecting a cluster scheme out of a set of possible schemes may be treated as a problem of model selection. A natural model selection procedure may be based on the Bayesian information criterion ( BIC ), that naturally involves the divergence ( 78 ). Thus we consider the optimization problem: C = arg max C BIC ( X j j C ) (82) where, in the max operation, C is running over all possible partitions of I into agglomerations ( i.e. cluster of contiguous regions) and BIC ( X j j C ) = T ( X j j C ) ( M ( C ) 1) ln ( N ) (83) with T ( X j j C ) as dened in ( 78 ). A sketch of the algorithm. The number of possible cluster schemes can be enormous for an even modest number of basic regions. Thus, it is necessary to consider limited search procedures that yield reasonable approximations to best cluster schemes. Our approach is essentially an elaboration of the basic ideas of the scan method proposed by Besag and Newell (1991) in which, given the set of basic regions, we start with individual regions and progressively add contiguous regions to nd the most signicant cluster, evaluating all possible cluster schemes that can be formed from these regions. This algorithm may also be viewed in the family of hierarchical divisive clustering algorithms (for an interesting synthesis of this topic, one may consult http://home.dei.polimi.it/matteucc/Clustering/tutorial_html/hierarchical. html , see also http://nlp.stanford.edu/IRbook/html/htmledition/divisiveclustering1.html ). Experience suggested that the divisive structure of this algorithm is likely to be suitable for taking due account of the constraint of contiguity. For each activity j , we develop a greedy forward algorithm that uses the BIC as a selection criterion. We start with a baseline conguration, generally made of all regions with N ij > 0; in some cases, we might also select the regions with N ij larger than some specied minimum. The algorithm generates, in a rst (myopic) version, a sequence of conguration C [ k ] as follows: Step 0 . As an initial step, C [0] is the oneterm partition: C [0] = C max (84) In this case, BIC ( X j j C [0] ) = T ( X j j C [0] ) = 0 , because M ( C ) = 1 . 25
8. the closest distribution revealing independence between regions and activities, taken as a benchmark of a completely nonconcentrated, or nonspecialized, country. This measure may be called a measure of \global localization" or, following Perroux (1950), of \polarization"; this book uses both terms interchangeably. In Table 2 we have summarized, under some conventional headings, these dierent levels of analysis, denoting by d ( p j q ) an arbitrary divergence or distance between distributions p and q . Table 2: Some conventional denitions Technique Measured concept d ( p ~ j j [1 =J ]) Absolute industrial homogeneity d ( p ~ i j [1 =I ]) Absolute regional homogeneity d ( p ~ j j i j [1 =J ]) Absolute specialization of region i d ( p ~ i j j j [1 =I ]) Absolute concentration of activity j d ( p ~ j j i j p ~ j ) Relative specialization of region i d ( p ~ i j j j p ~ i ) Relative concentration of activity j d ([ p ij ] j [ p i p j ]) Global localization, or polarization, of the country Absolute homogeneity refers to the spread of the (marginal) distribution of the regions p ~ i or of the activities p ~ j . Absolute regional specialization , is a feature of the distribution of activities across a region p ~ j j i , and a region is said to be absolutely specialized if a few activities concentrate a large share of the region. This may be the case, for instance, when an activity is considerably larger than others at a country level. Relative regional specialization of a region shows up when an area has a greater proportion of a particular activity than the proportion of that activity in the whole territory. In other words, relative regional specialization compares an area share of a particular activity with the activity share at the country level, and is accordingly measured through a discrepancy d ( p ~ j j i j p ~ j ), thus relatively to the marginal distribution p ~ j . The similar comment is also valid for absolute and relative industrial concentration. In order to introduce the concept of global localization or polarization , imagine the following (articial) experiment. Draw randomly one primary unit from the N ones and classify the drawn primary unit into the region and the activity. The probability of drawing a primary unit from the cell ( i;j ) is evidently p ij . Within this framework, the absence of global localization may be viewed as a stochastic independence between the row and the column criteria: for instance, in every region, there would be a same probability that a randomly drawn individual is active in any specic activity. Thus, global localization may be viewed as an association between the region and the activity variables. This suggests to measure the degree of global localization through a statistic that might be used for testing independence in a contingency table: this is precisely operated by the discrepancy d ([ p ij ] j [ p i p j ]). 8
20. represents how much the optimized table has gained, in inertia, relatively to a table with a same cluster scheme but with randomly shued individuals and variables. The algorithm terminates by dening the best collapsed table, I n ;k J m ;k through the solution of the maximization problem: ( n ;m ;k ) = arg max n;m;k ( n;m;k ) ; (58) balancing by sodoing the tradeo between the association degree and the table dimension. Remark . In general, a clustering of a table involves a loss of information, measured by a decrease of the inertia. In the extreme cases, T I ( I ) J ( J ) is a 1 1 table representing a maximum level of clustering and maximum loss of information, whereas T I (0) J (0) is a I J table representing the original table with no loss of information. In both cases, bootstrapping is irrelevant. Finally, the Choropleth map of each country shows the grouped regions ( g regions) obtained from the optimal collapsed table at the last available period of the data. 5 Identication of specialized agglomerations This section develops new statistical and computational methods for the automatic detection of agglom erations displaying an over or under relative specialization spatial pattern. A probability model is used to provide a basis for a space partition into clusters representing homogeneous portions of space as far as the probability of locating a economic unit is concerned. A cluster made of contiguous regions is called an agglomeration. A greedy algorithm detects specialized agglomerations through a model selection criteria. A random permutation test evaluates whether the contiguity property is signicant. As a preliminary step we rst present the notation for clusters of regions. 5.1 Notation for clusters of regions Let us operate a partition of the I regions into M \grouped regions", to be called \gregions" for the ease of exposition. This regrouping may be written in terms of the labels: I = f 1 ; 2 ; ;I g = M [ m =1 I m I m \I m 0 = ; ( m 6 = m 0 ) #( I m ) = I m X m I m = I (59) Let us dene accordingly N m = X i 2I m N i N m;j = X i 2I m N ij (60) Using g to denote relative frequencies on the space of the gregions, we successively dene: g m = X i 2I m p i = N m N g m j j = X i 2I m p i j j = N m;j N j (61) 20
28. xed quote; ignoring the regions where N ij = 0 actually accelerates the processing of the algorithm (selected: N ij > 4);  or according to the sign of log LQ : all, only this with log LQ > 0 (overspecialized) or only those with log LQ < 0 (underspecialized), possibly with 0 replaced by + or " . Note: both criterion may be combined or not; specication of the total number of primary units to be taken into account in the evaluation of the criterion: either N (coded as TRUE), or the sum of the N ij corresponding to the regions actually selected in the previous step (coded as FALSE). The case \FALSE" actually ignores the unselected regions although present in the country (selected: = TRUE); parameter \sign": \sign" = TRUE when the agglomerations contain only regions with a same sign of the log LQ , i.e. only overspecialized or only underspecialized regions, otherwise \sign" is FALSE (selected: sign = TRUE; more explicitly: if i [1] and i 2 are aggregated into a same cluster, then the log of the location quotients LQ i [1] ;j and LQ i 2 ;j have the same sign; i.e. both regions i [1] and i 2 are either overspecialized or underspecialized). Note: the parameter \sign" is activated at each step of the algorithm but if, in the second parameter, the regions have been selected on the sign of log LQ , the parameter \sign" is always TRUE; number of bootstrap replications (selected: B = 1000); criterion: BIC or AIC (selected: BIC); two parameters for the stopping rule:  rstly a selection between stopping according to a change of sign in the trajectory of the criterion or according to the local slope of the trajectory of the criterion (selected: change of sign);  secondly the width of the window within which is evaluated the change of sign or the slope of the criterion (selected: h = 60). 5.7 Manufacturing competitiveness For each region i of a country, a composite manufacturing competitiveness index, Mc [ i ] , may be evaluated as follows: Mc [ i ] = NMaler [ i ] + NMqler [ i ] + NMl [ i ] + NMp [ i ] + NMre [ i ] [ Meu ij ] + NMre [ i ] [ Mem ij ] + NMae [ i ] (86) where NMre [ i ] is the regional specialization of regions i ( 15 ) based on manufacturing economic units [ Meu ij ] and manufacturing employment [ Mem ij ] and Mae [ i ] is the number of overspecialized agglomer ations I m j C of all activities j of which region i is a part (for more details see Section 5 ). 28
11. Regional concentration of manufacturing activities For each country, we have been evaluated the regional concentration of each activity j separately, Mrc ij , for both primary units: manufacturing economic units, [ Meu ij ], and manufacturing employment, [ Mem ij ], as follows Mrc ij = p ij log p ij p i p j (17) And for international comparison of the regional concentration of every single activity j , Mrc [ j ] = X I i =1 p ij log p ij p i p j (18) The Choropleth maps of each country show the values of Mrc ij [ Meu ij ] and Mrc ij [ Mem ij ] for each activity j separately, both at last available period, grouping the overspecialized regions i into three classes: high, medium and low, using the Jenks Natural Breaks classication method. 3.2 Further indicators Population For each country, the number of people, either for a country, Pop , or for a region i , Pop i , are evaluated as follows: Pop = X I i =1 Pop i (19) The Choropleth map of each country shows the values of Pop i at last available period, aggregated into three classes: high, medium and low, using Jenks Natural Breaks classication method. Relative Variation of the Population For each country, the relative variation of the number of people, either for Pop or for Pop i , has been evaluated by computing its values at two dierent instants, namely t 1 and t 2 with t 1 < t 2 , as follows: RV Pop i j t 1 ;t 2 = Pop i j t 2 Pop i j t 1 1 100 (20) RV Pop t 1 ;t 2 = X I i =1 Pop i j t 2 Pop i j t 1 1 100 : (21) The Choropleth map of each country shows the values of RV Pop i j t 1 ;t 2 , aggregated into three classes: increased, no changes and decreased. 11
24. clusters (more on this concept in Haedo and Mouchart 2012). Note that the argument of the logarithms in ( 78 ) is the location quotient obtained after clustering the regions. Notwithstanding the existence of ag glomeration economies, economic units' location decision is modeled as independent of these agglomeration economies as the model is essentially a static one that does not present the dynamics of agglomeration formation. The weight N m;j j C in ( 78 ) of the logarithm of LQ may be viewed as a solution to a small areas problem in line with the works of Moineddin et al. (2003), O'Donoghue and Gleave (2004) and Guimar~aes et al. (2003 and 2009). In ( 73 ), H 0 represents M ( C ) 1 restrictions for a xed j under the condition that the sum in m is equal to 1. Therefore under H 0 , the test statistics T ( X j j C ) is asymptotically distributed as a chisquare distribution with M ( C ) 1 degrees of freedom. The asymptotic p value for this likelihoodratio test is given by p value = 1 F M ( C ) 1 ( T ( X j j C )) (79) where F M ( C ) 1 denotes the cumulative distribution function for the chisquare distribution with M ( C ) 1 degrees of freedom. The null hypothesis is rejected when the value of the likelihoodratio test is suciently large, or when the corresponding p value is suciently small. 5.4 Detecting a specialized agglomerations scheme Up to now, we have developed a \space free" analysis as long as the labels of the region is arbitrary and convey no information on the localization of the regions. Now we introduce an idea of distancebased pattern by means of the concept of agglomerations that are clusters made of neighboring regions. The simplest case is obtained when neighboring regions is interpreted as contiguous regions. In that case, only regrouping contiguous regions is of interest; therefore each
I m j C should be a connected set of regions. The contiguity matrix (or weights matrix) W formally expresses the proximity links existing between all pair of regions. The elements of the I I matrix W are obtained as the values of the following function: w : II !f 0 ; 1 g where w ( i 1 ;i 2 ) = 1I f i 1 and i 2 are contiguous g (80) Note that W is symmetric ( W = W 0 ) with 1s on the main diagonal and the sum of the rows (or, of the columns) minus 1 is equal to the number of contiguous regions of each region in the set I . Therefore, the set of regions contiguous to a cluster I m j C may be written as: v ( I m j C ) = f i 1 2InI m j C j9 i 2 2I m j C : w ( i 1 ;i 2 ) = 1 g (81) Remark on the W matrix. The concept of contiguity underlying ( 80 ) deserves to be made more precise. Contiguity may mean at least one point common in the boundaries of the two contiguous regions, in which case W is a rst order queen weights matrix, or contiguity may mean a partly common frontier with more than one point, in which case W is a rst order rook weights matrix; for more information on weights matrices see O'Sullivan and Unwin (2010). 24
10. This location quotient reveals the following feature of activity j in region i : LQ ij = 1 or p ij = p i p j nonspecialization > 1 or p ij > p i p j overspecialization < 1 or p ij < p i p j underspecialization (14) where \nonspecialization" corresponds to a local contribution to the rowcolumn independence. It should be clear from ( 14 ) that a discrepancy between the distributions [ p ij ] and [ p i p j ] is equivalent to a dis crepancy between the matrix [ LQ ij ] and a corresponding matrix of one's. Note that LQ ij is valued in [0 ; + 1 ) and that p i p j > 0. The last two equalities in ( 13 ) emphasize that the specialization is an issue concerning the global structure at a country level: thus the absence of specialization of a cell ( i;j ) means that, relative to the distribution in the country , activity j is not over(nor under) represented in region i and that region i is not over(nor under) represented for activity j . Thus, \location" points to the fact that LQ ij is localized in the cell ( i;j ). Remark. Recent works, among others by Moineddin et al. (2003), O'Donoghue and Gleave (2004), Guimar~aes et al. (2003 and 2009) or Haedo (2009), have drawn the attention on the socalled \small area problem" for the location quotient. Indeed, when a region i is very small, as compared with other regions of a country, the value of its location quotient may not be put on an equal foot with the location quotients relative to other regions. In Section 5.3 , we shall weight the log of the location quotient with the corresponding N ij , see equation ( 78 ), with the eect of representing more adequately the impact of a small area in the characterization of the relative industrial concentration of a given activity. Manufacturing regional specialization For each country, we have been evaluated the regional specialization of each region i , Mre [ i ] , for both primary units: manufacturing economic units, [ Meu ij ], and manufacturing employment, [ Mem ij ]. Mre [ i ] = X J j =1 p ij log p ij p i p j (15) And for international comparisons of the global regional specialization or global localization levels: Mre = X I i =1 X J j =1 p ij log p ij p i p j (16) The Choropleth maps for each country show the values of Mre [ i ] [ Meu ij ] and Mre [ i ] [ Mem ij ], both at last available period, grouping the overspecialized regions i into three classes: high, medium and low, using the Jenks Natural Breaks classication method. 10
9. Remark These concepts are based only on information contained in the contingency table. In some cases, it may be of interest to use as a benchmark distribution a distribution relative to an exogenous variable. For instance, for distributions on the regions, such as p ~ i or p ~ i j j , a benchmark distribution might be the distribution of the areas or of the populations of the regions. For instance, Mori and Smith (2011) have proposed that the benchmark of the areas might be used as a basis for an hypothesis of a purely non concentrated activity. Similarly, the distribution of the populations of the regions may be used, in epidemiology, as a benchmark for the nonconcentration of a disease of interest. 3.1.2 An additional note on international comparisons When comparing the regional structure of several countries, the divergences, shown up in Table 2 , are eval uated countrywise. It should be emphasized that these comparisons may crucially depend on the choice of a particular discrepancy, or dissimilarity, among distributions. This work makes reference to three discrep ancies of particular interest, namely the Hellinger distance and the KulbackLeibler and 2 divergences. In the nite case of two distributions q = ( q 1 ; ;q n ) and r = ( r 1 ; ;r n ), these discrepancies are dened as follows: d 2 H ( q ~ i j r ~ i ) = 1 2 X I i =1 ( p q i p r i ) 2 Hellingerdistance (10) d 2 ( q ~ i j r ~ i ) = X I i =1 r i q i r i 1 2 2 divergence, or inertia (11) d KL ( q ~ i j r ~ i ) = X I i =1 q i log q i r i KullbackLeibler divergence (12) Experience shows that the ranking among regions or activities may be robust with respect to changes of the discrepancy in some cases but may also crucially depend on it in other cases. However, the ranking of the measures of polarization is generally robust. For some interesting examples, see Haedo and Mouchart (2012). 3.1.3 Local approach A natural start for the analysis of relative concepts is a local one, where one examines whether a cell ( i;j ) reveals over or underspecialization, or equivalently whether a cell ( i;j ) reveals over or under concentration. For this purpose, the wellestablished Location Quotient (Florence, 1939), also known as the (estimated) HooverBalassa coecient for the cell ( i;j ), may be written in several equivalent forms as: LQ ij = N ij =N i N j =N = N ij =N j N i =N = N ij N N i N j = p ij p i p j = p j j i p j = p i j j p i (13) The last three equalities in ( 13 ) express the same concept through proportions, i.e. independently of N that represents the number of observations. 9
15. manufacturing structure in terms of relative sub and overspecialization and activities with a similar spatial pattern in terms of relative sub or overconcentration. Shortly said, we want to build an algorithm for an automatic summary of a possibly large regions activities contingency table that keeps (almost) unchanged the polarization of the economy. When the algorithm looks for collapsing regions or activities, no restriction is considered about the regions or the activities to be clusterized. Thus, for the regions, no criteria of contiguity, or of some distancebased pattern, is operating because the algorithm is not looking for agglomerations, in the sense of clustering \neighboring" regions. The clusters to be elicited are of a structural nature, i.e. clusters of regions with a similar relative regional specialization pattern, or similar sectorial structure, irrespectively of their geographical localization. Similarly, when collapsing activities, no consideration of intersectorial relationship, nor of value chain, is operating because only a similar relative manufacturing concentration is at stake. Our purpose is to summarize the original information, i.e. the complete contingency table N = [ N ij ], to extract the most relevant patterns of specialization in the data. The actual challenge should be kept in mind. In the case of Argentina, for instance, there are I = 511 regions. Using the homologated manufacturing activities of Table 1 there are J = 21 activities. In 2004, for example, the total number of manufacturing employees was N = 955 ; 965. Thus the contingency table is a 511 21 matrix of 955,965 primary units spread in 10,731 cells. It should be expected that many cells have either a very small number of manufacturing employees or no employee at all. The skeleton of the proposed algorithm may be viewed as follows. An \optimal" grouping of regions and activities should compromise between two opposite desiderata: the collapsed table should be as small as possible but should also display a minimum loss of polarization of the country. Collapsing tables means building tables of smaller dimension through aggregated regions (rows) and/or activities (columns). The total number M of possible collapsed tables 2 for the I J matrix N is M = X ( m 1 :::m i :::m l ) I m 1 :::m i :::m l X ( n 1 :::n i :::n k ) J n 1 :::n j :::n k (39) where l 6 I 1, k 6 J 1, m 1 + ::: + m i + ::: + m l < I and n 1 + ::: + n j + ::: + n k < J . For I and J large, as in the present case, M is huge and trying all possibilities is not feasible. Therefore, we look for a greedy algorithm that only ensures a local optimum. This is obtained by means of a technique of hierarchical clustering, according to a dendrogram approach, combined with a correspondence analysis. Finally, at each step of the tree, permutation bootstrapping is used as a test that the envisaged regrouping performs better than if it had been generated randomly. 2 Equation ( 39 ) may also be written as a product of two Bell numbers B n = P ( m 1 :::m i :::m l ) n m 1 :::m i :::m l = P 0 k n n k where l 6 n 1, m 1 + : : : + m i + : : : + m l < n . The Bell number is the sum of Stirling numbers of the second kind S ( n; k ) that are equal to the number of partitions with k elements of a set with n members. Thus, the Bell number represents the total number of partitions of a set of n elements. In equation ( 39 ), we have n = I and m = J . More details may be found in Rota (1964), Gardner (1978), Branson (2000) or Sloane (2001). 15
27. the form d ( p ~ i j j j p i ). The question is to try to understand why the industrial concentration of activity j tends to cluster into some agglomerations. As a rst step it is natural to ask whether the contiguity among regions is a signicant factor of clustering into over or under specialized agglomerations. Indeed, as a dierence from geolocalized data, lattice data provides no information on the intraregional localizations. But a major role of contiguity, among regions, in the formation of specialized agglomerations provides evidence that, for a specic activity, localization economies, inside an agglomeration, have more impact. Formally, we want to test the null hypothesis that the vector of the location quotients, for a xed activity j , is invariant for the group of permutations of its coordinates. The permutation bootstrap is a standard methodology for facing such a question. Indeed, redistribu tions of LQ 0 s among all regions without replacement has been used when assessing spatial dependence between neighbouring regions, see e.g. Manly (1991), Zoellner and Schmidtmann (1999), Good (2000) and Lawson (2006). Each redistribution is simulated independently of the contiguities among regions, thus independently of the matrix W , and if for each simulation we run the algorithm and compute the optimal BIC , then we may appreciate where the optimal BIC ( X j j C ) is localized relatively with the distribution of the simulated BIC . In particular, one may decide that when BIC ( X j j C ) is far in the tail of the distribution of the simulated BIC , it is a signal that contiguity is a signicant factor of cluster ing. The signicance of BIC ( X j j C ) is accordingly evaluated thought the bootstrap distribution. More specically, let us write BIC b ( X j j C b ) for the BIC of the optimal cluster scheme obtained as a result of the b th simulation and b F B BIC for the empirical distribution function of BIC b ( X j j C b ) obtained after B simulations. The empirical p value of BIC ( X j j C ) is therefore: p value[ BIC ( X j j C )] = 1 b F B BIC ( BIC ( X j j C )) = 1 B X 1 b B 1I f ( BIC b ( X j j C b ) >BIC ( X j j C ) g (85) Thus the bootstrap p value is, in general, simply the proportion of the bootstrap test statistics BIC b ( X j j C b ) that are more extreme than the observed test statistic BIC ( X j j C ): rejecting the null hypothesis whenever p value [ BIC ( X j j C )] < is equivalent to rejecting it whenever BIC ( X j j C ) exceeds the 1 quantile of b F B BIC . 5.6 The main parameters of the algorithm The algorithm treats each activity j independently and requires the specication of several parameters. The main parameters are: contiguity matrix: simple rst order queen or rook matrix; see Remark on the W matrix in Section 5.4 ) (selected: rook); selection of the regions i :  either according to N ij : either all regions, or only those with N ij > 0, or only those with N ij > some 27
14. Manufacturing level For each region i of a country, one may evaluate a manufacturing level, Ml [ i ] , as the ratio of the rate of manufacturing employment in region i and the corresponding rate for the country: Ml [ i ] = Mem i Pop i Mem Pop (38) The Choropleth map of each country shows the values of Ml [ i ] at last available period, aggregated into three classes: high, medium and low, using the Jenks Natural Breaks classication method. Manufacturing performance For each region i of a country, the manufacturing level, Ml [ i ] , may be evaluated by computing its Ml [ i ] at two dierent instants, t 1 and t 2 with t 1 < t 2 . Hence, the Mp [ i ] distinguish between dierent scenarios or classes of Ml [ i ] j t 1 ;t 2 as follows: 1 = raising industrialized regions , when Ml [ i ] j t 2 > Ml [ i ] j t 1 > 1; 2 = declining industrialized regions , when Ml [ i ] j t 1 > Ml [ i ] j t 2 > 1; 3 = new industrialized regions , when Ml [ i ] j t 2 > 1 > Ml [ i ] j t 1 ; 4 = desindustrializing regions , when Ml [ i ] j t 1 > 1 > Ml [ i ] j t 2 ; 5 = developing industrialized regions , when 1 > Ml [ i ] j t 2 > 0 : 60, and Ml [ i ] j t 2 > Ml [ i ] j t 1 ; 6 = unindustrialized regions , when 0 : 60 > max f Ml [ i ] j t 1 ;Ml [ i ] j t 2 g . where the threshold value 0 : 60 is somewhat arbitrary but aimed at indicating a movement of clear signi cance. The Choropleth maps of each country show the classes of Mp [ i ] . 4 Regional manufacturing structure: simultaneous grouping of regions and activities 4.1 Background The purpose of this section is to extend in two directions the usual analysis of a given contingency table. Firstly, we want to examine simultaneous groupings of regions and activities, rather than separate ones. Secondly, instead of considering arbitrarily prespecied groupings we look for an automatic construction of grouping aimed at providing optimal groupings according to a prespecied criterion. From an economic geography point of view, the discrepancy d ([ p ij ] j [ p i p j ]), taken as a measure of polarization, or global localization, of the country, summarizes the spatial pattern of the economic activities, or equivalently the distribution of the activities among the regions. We aim to simultaneously regroup regions with a similar 14
5. This report is mostly concerned with the relative industrial concentration , i.e. the concentration of a given activity in a given region relatively to other regions and with the relative regional specialization , i.e. the shares of the dierent activities in a given region as compared that those shares at the country level. Following Haedo and Mouchart (2012), the basic tools for these analyses are derived from the proles provided by the contingency table N = [ N ij ]; more explicitly: region i may be characterized by the prole (or conditional distribution) of the i th row 1 : p ~ j j i = ( p 1 j i ; ;p j j i ; ;p J j i ) p j j i = N ij N i (2) to be compared with the global row prole (or marginal distribution): p ~ j = ( p 1 ; ;p j ; ;p J ) p j = N j N (3) similarly, activity j may be characterized by the prole (or conditional distribution) of the j th column: p ~ i j j = ( p 1 j j ; ;p i j j ; ;p I j j ) p i j j = N ij N j (4) to be compared with the global column prole (or marginal distribution): p ~ i = ( p 1 ; ;p i ; ;p I ) p i = N i N (5) A high proportion of the data used in this work are presented in the form of these proles, or conditional distributions, as a natural way of representing the regional structure of an industry or the industrial structure of a region. For later use, it may be convenient to refer to the areas rather than to the (arbitrary) labels of the regions. Thus we also denote the country, considered as an area, as
and the disjoint regions as
i . Evidently, the regions
i provide a partition of the country
:
i 6 = ;
i \
i 0 = ; ( i 6 = i 0 ) I [ i =1
i =
(6) We accordingly write: p i = p (
i ) (7) The analysis may be conducted in terms of the N primary units labeled by u , i.e. u 2U with #( U ) = N along with a localization function ` : U !
where ` ( u ) stands for the localization of u in
and an activity function a : U ! J where a ( u ) stands for the activity of the primary unit u . For each pair ( i;j ), the primary unit u is associated with a binary variable: x u ij = 1I f ` ( u ) 2
i ;a ( u )= j g (8) 1 When the components of a vector are indexed by i (regions) or by j (activities), we use an arrow above the index that denes the components of the vector. 5
12. Manufacturing economic units For each country and for each pair ( i;j ) 2 I J , one may evaluate the number Meu ij of manufacturing economic units. Thus we obtain a twoway I J contingency table N = [ Meu ij ] that also produces row, column and table totals denoted as follows: Meu i = X J j =1 Meu ij (22) Meu j = X I i =1 Meu ij (23) Meu = X I i =1 X J j =1 Meu ij = X I i =1 Meu i = X J j =1 Meu j (24) The Choropleth maps of each country show the values of Meu i and of Meu j separately for each activity j , both at last available period, aggregated into three classes: high, medium and low, using the Jenks Natural Breaks classication method. Relative Variation of the Manufacturing economic units For each country, the relative variation of the manufacturing economic units has been evaluated by com puting N = [ Meu ij ] at two dierent instants, t 1 and t 2 with t 1 < t 2 , as follows: RV Meu ij j t 1 ;t 2 = Meu ij j t 2 Meu ij j t 1 1 100 (25) RV Meu [ i ] j t 1 ;t 2 = X J j =1 Meu ij j t 2 Meu ij j t 1 1 100 (26) RV Meu [ j ] j t 1 ;t 2 = X I i =1 Meu ij j t 2 Meu ij j t 1 1 100 (27) RV Meu t 1 ;t 2 = X I i =1 X J j =1 Meu ij j t 2 Meu ij j t 1 1 100 (28) The Choropleth maps of each country show the values of RV Meu [ i ] j t 1 ;t 2 and of RV Meu [ j ] j t 1 ;t 2 sepa rately for each activity j , aggregated into three classes: increased, no changes and decreased. Manufacturing availability of local enterprise resources For each region i of a country, one may evaluate the manufacturing availability of local enterprise resources, Maler [ i ] , as the ratio between the manufacturing economic units, Meu i , and the population, Pop i : Maler [ i ] = Meu i Pop i (29) The Choropleth map of each country shows the values of Maler [ i ] at last available period, aggregated into three classes: high, medium and low, using Jenks Natural Breaks classication method. 12
13. Manufacturing employment For each country and for each pair ( i;j ) 2 I J , we observe the number Mem ij of manufacturing jobs held. Thus we obtain a twoway I J contingency table N = [ Mem ij ] that also produces row, column and table totals denoted as follows: Mem i = X J j =1 Mem ij (30) Mem j = X I i =1 Mem ij (31) Mem = X I i =1 X J j =1 Mem ij = X I i =1 Mem i = X J j =1 Mem j (32) The Choropleth maps of each country show the values of Mem i and of Mem j separately for each activity j , both at last available period, aggregated into three classes: high, medium and low, using the Jenks Natural Breaks classication method. Relative Variation of the Manufacturing employment For each country, the relative variation of the manufacturing jobs held has been calculated by computing N = [ Mem ij ] at two dierent instants, t 1 and t 2 with t 1 < t 2 , as follows: RV Mem ij j t 1 ;t 2 = Mem ij j t 2 Mem ij j t 1 1 100 (33) RV Mem [ i ] j t 1 ;t 2 = X J j =1 Mem ij j t 2 Mem ij j t 1 1 100 (34) RV Mem [ j ] j t 1 ;t 2 = X I i =1 Mem ij j t 2 Mem ij j t 1 1 100 (35) RV Mem t 1 ;t 2 = X I i =1 X J j =1 Mem ij j t 2 Mem ij j t 1 1 100 (36) The Choropleth maps of each country show the values of RV Mem [ i ] j t 1 ;t 2 and of RV Mem [ j ] j t 1 ;t 2 sep arately for each activity j , aggregated into three classes: increased, no changes and decreased. Manufacturing quality of local enterprise resources For each region i of a country, one may evaluate the manufacturing quality of local enterprise resources, Mqler [ i ] , as the ratio between the manufacturing employment, Mem i , and the manufacturing economic units, Meu i : Mqler [ i ] = Mem i Meu i (37) The Choropleth map of each country shows the values of Mqler [ i ] at last available period, aggregated into three classes: high, medium and low, using the Jenks Natural Breaks classication method. 13
22. The concept of specialized agglomeration is to be built progressively. As a rst step, we use, as an hypothesis maintained throughout this work: H m : i j j;m ; C = i j m ; C with X i 2I m j C i j m ; C = 1 (67) This hypothesis of conditional independence, namely i ?? j j m ; C , means that once an individual has selected an activity j , he selects a cluster m likely to be suitable for his activity j and when, conditionally on his choice ( j;m ), he selects a region, within the cluster m , he considers that within
I m j C the regions exert an attraction independent of his sector of activity. Moreover, because: i j m ; C = i m j C ; (68) the maintained hypothesis also assumes that the attractivity of region i , within cluster m , only depends of its general (or marginal) size i , relatively to the cluster size, m j C . Thus in ( 68 ) all the activities are taken into account through i = P j ij ; in other words the role of i in i j m ; C is to provide a proxy for the set of characteristics of region i being favorable to the development of an activity in general. Under our maintained hypothesis we have: i;m j j ; C = m j j ; C i j m ; C 8 i 2I m j C (69) The algorithm, to be sketched in Section 5.4 , provides a exibility to adjust the regions to be taken into account when modeling a particular activity j ; thus, for a given j it may be specied that only the regions with N ij > 0 or only the regions with N ij larger than some prespecied limit will be the object of modeling. For a particular activity j we eventually have I j regions to be taken into account and we dene an A j I j matrix X j = [ x u ij ] where i = 1 ; ;I j in columns and u , in rows, runs over the set A j of the economic units entering the relevant regions for the analysis of the activity j with A j = #( A j ). As X j is an incidence matrix with elements equal to x u ij for u 2A j , the sum of each row is equal to 1. From now on, we explicitly write that the number of clusters depends on the cluster scheme under consideration; thus we shall write M ( C ) instead of M . It is shown, in Haedo and Mouchart (2015), that the probability of the data matrix X j may be factorized into: p ( X j j C ) = 2 4 Y 1 m M ( C ) N m;j j C m j j ; C 3 5 [ b ( X j ) ] (70) where N m;j j C = X i 2I m j C N ij b ( X j ) = Y 1 m M ( C ) Y i 2I m j C N ij i j m ; C (71) As the parameter of the factor b ( X j ) does not depend on j , we may factorize the likelihood function as L ( X j j C ) = L 1 ( ~m j j ; C j X j ) L 2 ([ i j m ; C ] j X j ) (72) 22
21. g ~m = ( g 1 ; ;g m ; ;g M ) g ~m j j = ( g 1 j j ; ;g m j j ; ;g M j j ) (62) p i j m = p i g m 1I f i 2I m g = N i N m 1I f i 2I m g p i j j;m = p i j j g m j j 1I f i 2I m g = N ij N m;j 1I f i 2I m g (63) This regrouping may also be viewed in terms of a cluster scheme of areas, C = fI 1 j C ; ; I m j C ; ; I M j C g , consisting of disjoint regional clusters:
I m j C = [ i 2I m j C
i with M [ m =1
I m j C =
(64) Thus, when we want to make explicit the role of a particular cluster scheme C , we also write, instead of g m : g m j C = g (
I m j C ) = X i 2I m j C p i 5.2 A structural model We now introduce a model aimed at representing how the data have been generated and eventually may be interpreted; the notation eventually distinguishes unknown parameters, in Greek letters, and functions of data (estimators or statistics) in Latin letters although we also use Greek letters with hat for estimators. We start with an arbitrary cluster scheme C = fI 1 j C ; ; I m j C ; ; I M j C g . The stochastic model involves three categorical random elements, namely: regions ( i ), activity ( j ) and cluster ( m ). When drawing randomly a economic unit u from the universe U , we therefore need to specify a trivariate distribution i;j;m . The process is decomposed as follows: 1. individual u selects an activity j according to a distribution j ; 2. conditionally on j , individual u selects a cluster m according to a distribution m j j ; C ; 3. conditionally on ( j;m ), individual u selects a region
i within cluster
I m j C according to a distribu tion i j j;m ; C . In short, we consider as structural the following decomposition: i;j;m j C = j m j j ; C i j j;m ; C i 2I m j C (65) Notice that, from ( 64 ) we have i j j = i;j; ;j; m j j ; C = X i 2I m j C i j j m j C = X i 2I m j C i (66) In next subsection we design a procedure for identifying specialized agglomeration, by means of a cluster scheme C dierent for each activity j . Therefore, we do not discuss the specication of j and conduct the whole analysis conditionally on j . 21
23. 5.3 The concept of specialized cluster in a structural model Now we want to identify specialized clusters relatively to a specied activity j . Here a cluster I m is over specialized (resp. underspecialized) with respect to activity j when the i j j 's for i 2 I m are signicantly greater (resp. smaller) than the countrywide average i (remember that i is an average of the i j j 's, i.e. i = P j i j j j ) and in view of ( 13 ) this is equivalent to the location quotients being signicantly larger, or smaller, than 1. Under the maintained hypothesis ( 67 ), it may be seen from ( 70 ) that the identication of specialized clusters and the construction of a cluster scheme C of specialized clusters is to be based on the properties of m j j ; C . A (fully) nonspecialized cluster I m is a cluster where LQ ij = 1 for 8 i 2I m (equivalently, i j j = i or j j i = j ). This hypothesis, extended to each cluster m , i.e. LQ ij = 1 8 i 2I , implies, because of ( 66 ): H 0 : (0) m j j ; C = (0) m j C m = 1 ; ;M (73) This hypothesis means that for the activity j and for the cluster scheme C , there is no industrial concen tration on the whole country. The maximum likelihood estimation under H 0 is therefore: b (0) m j j ; C = b (0) m j C = N m j C N (74) The absence of industrial concentration for activity j , underlying ( 74 ), is relative to a particular cluster scheme C . The estimated log likelihood under H 0 is ln b L (0) ( X j j C ) = X 1 m M ( C ) N m;j j C ln N m j C N + ln a ( X j ) (75) where a ( X j ) corresponds to the term L 2 ([ i j m ; C ] j X j ) in ( 72 ) and gathers the terms unaected by the null hypothesis. As an alternative hypothesis H 1 , the parameter m j j ; C is left unconstrained and maybe estimated as b (1) m j j ; C = N m;j j C N j (76) Thus when the alternative hypothesis is assumed for each cluster I m of a cluster scheme C , the estimated log likelihood under H 1 is ln b L (1) ( X j j C ) = X 1 m M ( C ) N m;j j C ln N m;j j C N j + ln a ( X j ) (77) It is shown in Haedo and Mouchart (2015) that the loglikelihoodratio statistic to test H 0 against H 1 , under H m may be written as: T ( X j j C ) = 2 2 4 X 1 m M ( C ) N m;j j C ln N m;j j C =N j N m j C =N 3 5 = 2 N j d ( p ~m j j j g ~m ) (78) where d ( j ) is the (nonsymmetric) KullbackLeibler divergence between two distributions. Therefore, the test statistic ( 78 ) may be viewed as a measure of relative industrial concentration of activity j among the 23
17. Similarly, the principal coordinates for the columns (activities) are: G = D 1 = 2 c V D = [ g jk ] J K g jk = p 1 = 2 j k v jk (45) where g jk represents the score of activity j in the k th dimension of the factor space IR K . Similarly, the decomposition of G into its J dimensional columns is denoted as G = [ g ~ j 1 ; ;g ~ jK ]. Here also: G 0 D c G = D 2 i : e : X j p j g 2 jk = 2 k (46) Thus, equation ( 46 ) decomposes the k th eigenvalue of S 0 S according to the contribution of each activity j , where I j = p j g 2 jk measures the contribution of the activity j . Summarizing, the SVD of S provides a decomposition of the total polarization 2 in terms of the contributions of each factor k and of the contribution of the regions i , respectively the activities j : 2 = X i I i = X i X k p i f 2 ik = X j I j = X j X k p j g 2 jk (47) For more details, see Mardia et al. (1979), Jobson (1992) and Greenacre (2007). Let us write the rows and columns proles as follows: D 1 r P = p ij p i = [ p j j i ] P D 1 c = p ij p j = [ p i j j ] (48) Comparing, by means of a divergence, a prole with the corresponding marginal distribution provides a measure of relative specialization of region i and of relative industrial concentration of activity j : d 2 ( p ~ j j i j p ~ j ) = X j ( p j j i p j ) 2 p j = [ p ~ j j i p ~ j ] 0 D 1 c [ p ~ j j i p ~ j ] = X k f 2 ik (49) d 2 ( p ~ i j j j p ~ i ) = X i ( p i j j p i ) 2 p i = [ p ~ i j j p ~ i ] 0 D 1 r [ p ~ i j j p ~ i ] = X k g 2 jk (50) Therefore, the decomposition of the Total inertia as a measure of polarization, in ( 47 ), may also be written in terms of average relative concentration or specialization: 2 = X i p i d 2 ( p ~ j j i j p ~ j ) = X j p j d 2 ( p ~ i j j j p ~ i ) (51) Equations ( 42 ) and ( 51 ) may also be interpreted in terms of divergences between row or columns proles, or conditional distributions. More details, under a stochastic independence approach, are given in Haedo and Mouchart (2012). 4.3 Best collapsed table: the algorithm Background The concept of distance between regions or activities is provided by means of a \square of weighted Eu clidean distances" (Greenacre 2011) among proles. 17
18. Thus, the similarity between the proles of two regions i and i 0 or two activities j and j 0 is measured as follows: X j 1 p j p ij p i p i 0 j p i 0 2 = X j 1 p j p j j i p j j i 0 2 = [ p ~ j j i p ~ j j i 0 ] 0 D 1 c [ p ~ j j i p ~ j j i 0 ] (52) X i 1 p i p ij p j p ij 0 p j 0 2 = X i 1 p i p i j j p i j j 0 2 = [ p ~ i j j p ~ i j j 0 ] 0 D 1 r [ p ~ i j j p ~ i j j 0 ] (53) The polarization of an economy decreases as a consequence of clustering and this loss of information is reduced by clustering the most similar regions or activities. Thus the algorithm chooses pairs of regions i and i 0 and pairs of activities j and j 0 minimizing the measures of dissimilarity ( 52 ) and ( 53 ). Following Ward (1963)'s approach, the pair of regions ( i;i 0 ) that gives the least decrease in inertia is identied by the pair of rows ( i;i 0 ) which minimize the following measure: p i p i 0 p i + p i 0 X j 1 p j p j j i p j j i 0 2 = p i p i 0 p i + p i 0 [ p ~ j j i p ~ j j i 0 ] 0 D 1 c [ p ~ j j i p ~ j j i 0 ] (54) The selected two rows are then merged by summing their frequencies and the prole and mass are recalculated. The same measure of dierence as ( 54 ) is calculated at each stage of the clustering. We also operate similarly for merging two columns. A collapsed table is characterized by two partitions: a partition I of the rows and a partition J of the columns. Thus a collapsed table is noted as T I J and is obtained by merging the rows and the columns of the original table according to the relevant partitions. Hierarchical clustering, of the rows or of the columns, generates a nested sequence of ( I + 1) partitions of the rows and ( J + 1) partitions of the columns, with the rst and the last ones being: I (0) = ff 1 g ; f 2 g ;:::; f I gg J (0) = ff 1 g ; f 2 g ;:::; f J gg (55) I ( I ) = ff 1 ; 2 ;:::;I gg J ( J ) = ff 1 ; 2 ;:::;J gg (56) The other not extreme ( I 1) and ( J 1) partitions corresponds to the levels of a dendrogram. In this section, we give the essentials of this algorithm. We shall use the example, in next section, to provide further details on the working of the algorithm. First step: building collapsed tables. Work on the rows . For k = 1 ; 2 ;:::;K : Consider the rst k columns of F , let F ( k ) = ( f ~ i 1 ;:::;f ~ il ;:::;f ~ ik ) I k , where f ~ il represents the l th column of F , and obtain a dendrogram through a hierarchical clustering of the rows of F ( k ) , corresponding 18
19. to the rows of S , as follows. Let I ( n;k ) n = 0 ;:::;I , with 8 k : I (0 ;k ) = I (0) and I ( I;k ) = I ( I ) , be the nested sequence of partitions of regions, starting with I (0) and with each following cluster obtained as an optimized clustering scheme based on I ( n 1 ;k ) . Thus 8 k , there are only I 1 relevant levels of the hierarchical clustering. Work on the columns . For k = 1 ; 2 ;:::;K : Repeat the same with the columns of G , namely G ( k ) = ( g ~ j 1 ;:::;g ~ jl ;:::;g ~ jk ) J k where g ~ il represents the l th column of G , and obtain a dendrogram through a hierarchical clustering of the rows of G ( k ) , corresponding to the columns of S , as follows. Let J ( m;k ) m = 0 ;:::;J , with 8 k : J (0 ;k ) = J (0) and J ( J;k ) = J ( J ) , be the nested sequence of partitions of activities, starting with J (0) and with each following cluster obtained as an optimized clustering scheme based on J ( m 1 ;k ) . Thus 8 k , there are only J 1 relevant levels of the hierarchical clustering. Building collapsed tables . For each level of the rows and columns dendrograms, build the ( I 1) ( J 1) collapsed tables T ( k ) I n;k J m;k and calculate the corresponding inertia 2 T ( k ) I n;k J m;k . Second step: identifying an optimal collapsed table. Having built the array A of #( A ) = ( I 1)( J 1) K collapsed tables, the nal question is: which of the collapsed tables is better in the sense of a best compromise between a smallest table that preserves the highest polarization ( i.e. association) possible? Permutation bootstrapping provides a tool for a suitable compromise. Bootstrapping . Let us consider whether a particular table T ( k ) I n;k J m;k is \best" in the sense alluded above. At least, we should check that this table is not dominated by a table obtained through a random shuing of the labels (of rows and/or of columns) based on a same level of the dendrogram. The optimized tables from the dendrograms are completely identied by the three characteristics ( n;m;k ). Here, I n;k is a partition I with I n elements, let fI 1 ;:::; I I n g . Let r be a permutation dened on I , i.e. r : I ! I , bijective and let us write r ( I n;k ) for the image of the partition I n;k transformed by r . Similarly, let c be a permutation dened on J and its image c ( J m;k ). Given ( r ; c ), one may dene a transformed table T ( k ) r ( I n;k ) c ( J m;k ) , following the same partition scheme as the optimized table T ( k ) I n;k J m;k with shued labels, and compute a corresponding inertia 2 T ( k ) r ( I n;k ) c ( J m;k ) . Note that the transformed table T ( k ) r ( I n;k ) c ( J m;k ) has a same dimension as T ( k ) I n;k J m;k ; thus their inertia are comparable. The dierence 2 T ( k ) I n;k J m;k 2 T ( k ) r ( I n;k ) c ( J m;k ) is an eect of the label shu lings of the rows and of the columns. The permutation bootstrap is obtained by generating randomly the permutations ( r ; c ) and evaluates the average, denoted as IE B , of the corresponding inertia. The dierence ( n;m;k ) = 2 T ( k ) I n;k J m;k IE B 2 h T ( k ) r ( I n;k ) c ( J m;k i (57) 19
16. 4.2 Singular value decomposition Correspondence Analysis is based on a classical result in matrix theory, namely the Singular Value De composition (SVD). Let P = N N be the probability matrix corresponding to N . Let r = ( p i ) the vector of row marginals and c = ( p j ) the vector of column marginals. Let D r and D c be the diagonal matrices formed with the row marginals and column marginals, respectively. Let us also consider the matrix of the residuals: R = P rc 0 = [ p ij p i p j ] and the matrix of the standardized residuals: S = D 1 = 2 r RD 1 = 2 c s ij = p ij p i p j p p i p j : (40) A SVD of S may be written as: S = UD V 0 (41) where = ( 1 ;:::; K ) is the vector of the strictly positive singular values, or eigenvalues, of S organized in descending order: 1 2 K > 0 with K = min ( I 1 ;J 1), and where D is accordingly K K , moreover U 0 U = V 0 V = I ( K ) . In this decomposition, the dimension of U is I K , of V is J K . The SVD of S will be used in the following spirit. Let us consider D ( m ) be the principal submatrix of D corresponding to the rst m eigenvalues k , let U ( m ) and V ( m ) be the submatrices made of the rst m columns of U and V , respectively. The leastsquares rank m approximation of S is obtained as: S ( m ) = U ( m ) D ( m ) V 0 ( m ) (EckartYoung theorem). Thus the sequence S ( m ) m = 1 ;:::;K is a sequence of improved approximations of S . The 2 divergence between [ p ij ] and [ p i p j ], or Total Inertia, may be written as: d 2 ([ p ij ] j [ p i p j ]) = 2 = I X i =1 J X j =1 ( p ij p i p j ) 2 p i p j = tr S 0 S = K X k =1 2 k (42) where 2 k also represents the k th eigenvalues of S 0 S . This inertia, being a measure of the polarization of the country, may be viewed as a global measure of the information provided by the contingency table P . The principal coordinates for the rows (regions) are: F = D 1 = 2 r U D = [ f ik ] I K f ik = p 1 = 2 i k u ik (43) where f ik represents the score of region i in the k th dimension of the factor space IR K . Later on, we shall systematically use the decomposition of F into its I dimensional columns denoted 3 as F = [ f ~ i 1 ; ;f ~ iK ]. It may be checked that: F 0 D r F = D 2 i : e : X 1 i I p i f 2 ik = 2 k (44) Thus, equation ( 44 ) decomposes the k th eigenvalue of S 0 S , also the k th component of the inertia, according to the contribution of each region i , namely I i = p i f 2 ik . 3 When the components of a vector are indexed by i (regions) or by j (activities), we use an arrow above the index that denes the components of the vector. 16
Vistas
 906 Vistas totales
 689 Vistas del sitio web
 217 Embedded Views
Acciones
 0 Social Shares
 0 Me gusta
 0 No me gusta
 0 Comentarios
Veces compartido
 0 Facebook
 0 Twitter
 0 LinkedIn
 0 Google+
Incrusta 1
 39 www.geoecon.info

Fuente de datos
1157 Visualizaciones

Fuente de datos
1157 Visualizaciones