The Internet of Production Alliance¶

Data report for the Data awards Round 1¶

Autor: Antonio de Jesus Anaya Hernandez, DevOps eng. for the IoPA.

Autor: The internet of Production Alliance, 2023.

Data was collected by "Glyxon labs', as part of the OKW Data Awards program.

The Open Know Where (OKW) Standard is part of the Internet of Production Alliance and its members.

License: CC BY SA

CC BY SA

Introduction¶

This review is provided as an analysis and recomendations document for the awardees participants of the OKW Data Awards.

In [1]:
import geopandas
import folium
from folium.plugins import HeatMap, MiniMap, FloatImage
import pandas as pd
import os
from datetime import datetime
from scipy.spatial import KDTree
import base64
In [2]:
filename = "threed.geojson"
In [3]:
print('Filename: \t', str(filename))
print('Format: \t', str(filename.split(sep='.')[1]).upper())
print('Modified: \t', str(datetime.fromtimestamp(os.path.getctime(filename)).strftime('%Y-%m-%d %H:%M:%S')))
print('Size: \t\t', str(os.path.getsize(filename)), ' KB')
Filename: 	 threed.geojson
Format: 	 GEOJSON
Modified: 	 2023-02-17 17:40:43
Size: 		 609052  KB
In [4]:
os.environ['PROJ_LIB'] = r'C:\Users\ANAYA\anaconda3\envs\okw_data_awards\Library\share\proj'
In [5]:
data = geopandas.read_file("threed.geojson")
type(data)
Out[5]:
geopandas.geodataframe.GeoDataFrame

1¶

Inspecting data¶

In [6]:
data
Out[6]:
name styleUrl icon-opacity icon-color icon-scale icon description geometry
0 Impresión 3D Infinity Makers (ELF Maker) #icon-1899-9C27B0-nodesc 1 #9c27b0 1 https://www.gstatic.com/mapspro/images/stock/5... None POINT Z (-99.24064 19.01404 0.00000)
1 Impresión 3D y Electrónica. DIAC-3D #icon-1899-9C27B0-nodesc 1 #9c27b0 1 https://www.gstatic.com/mapspro/images/stock/5... None POINT Z (-99.24327 18.92853 0.00000)
2 IMPRESIÓN 3D Taller de diseño #icon-1899-9C27B0-nodesc 1 #9c27b0 1 https://www.gstatic.com/mapspro/images/stock/5... None POINT Z (-99.20824 18.92575 0.00000)
3 Jart Studio (sucursal) - Impresión 3D #icon-1899-9C27B0-nodesc 1 #9c27b0 1 https://www.gstatic.com/mapspro/images/stock/5... None POINT Z (-99.14459 18.87846 0.00000)
4 3DZone S.A. de C.V. #icon-1899-9C27B0-nodesc 1 #9c27b0 1 https://www.gstatic.com/mapspro/images/stock/5... None POINT Z (-99.19289 18.93332 0.00000)
... ... ... ... ... ... ... ... ...
1633 Usina Fab Lab #icon-1899-0288D1 1 #0288d1 1 https://www.gstatic.com/mapspro/images/stock/5... {'@type': 'html', 'value': 'Facebook, Twitter,... POINT Z (-51.20268 -30.03210 0.00000)
1634 Adoro Robótica Makerspace #icon-1899-0288D1 1 #0288d1 1 https://www.gstatic.com/mapspro/images/stock/5... {'@type': 'html', 'value': 'Website works.<br>... POINT Z (-43.18070 -22.94186 0.00000)
1635 MXPCB #icon-1899-0288D1-nodesc 1 #0288d1 1 https://www.gstatic.com/mapspro/images/stock/5... None POINT Z (-89.64009 21.04402 0.00000)
1636 FabLab Cuiabá-BR #icon-1899-0288D1 1 #0288d1 1 https://www.gstatic.com/mapspro/images/stock/5... {'@type': 'html', 'value': 'Facebook and websi... POINT Z (-47.43440 -23.47281 0.00000)
1637 Oficina Maker #icon-1899-0288D1 1 #0288d1 1 https://www.gstatic.com/mapspro/images/stock/5... {'@type': 'html', 'value': 'Facebook works.<br... POINT Z (-63.85365 -8.75318 0.00000)

1638 rows × 8 columns

Removing unused columns¶

In [7]:
data = data.drop(columns=['styleUrl','icon-opacity', 'icon-color', 'icon-scale', 'icon'])
data
Out[7]:
name description geometry
0 Impresión 3D Infinity Makers (ELF Maker) None POINT Z (-99.24064 19.01404 0.00000)
1 Impresión 3D y Electrónica. DIAC-3D None POINT Z (-99.24327 18.92853 0.00000)
2 IMPRESIÓN 3D Taller de diseño None POINT Z (-99.20824 18.92575 0.00000)
3 Jart Studio (sucursal) - Impresión 3D None POINT Z (-99.14459 18.87846 0.00000)
4 3DZone S.A. de C.V. None POINT Z (-99.19289 18.93332 0.00000)
... ... ... ...
1633 Usina Fab Lab {'@type': 'html', 'value': 'Facebook, Twitter,... POINT Z (-51.20268 -30.03210 0.00000)
1634 Adoro Robótica Makerspace {'@type': 'html', 'value': 'Website works.<br>... POINT Z (-43.18070 -22.94186 0.00000)
1635 MXPCB None POINT Z (-89.64009 21.04402 0.00000)
1636 FabLab Cuiabá-BR {'@type': 'html', 'value': 'Facebook and websi... POINT Z (-47.43440 -23.47281 0.00000)
1637 Oficina Maker {'@type': 'html', 'value': 'Facebook works.<br... POINT Z (-63.85365 -8.75318 0.00000)

1638 rows × 3 columns

Looking for useful data in column 'description':¶

In [8]:
unique_desc = list(data['description'].dropna())
print(*unique_desc, sep='\n')
print('Unique data rows: ', unique_desc.__len__())
{'@type': 'html', 'value': 'Website, Facebook, Twitter, Instagram work. Offers training.<br><br>https://twitter.com/3dapplications<br>https://www.facebook.com/3DApplicationsBR<br>https://www.instagram.com/3dapplications/'}
{'@type': 'html', 'value': 'Facebook, Twitter, and website work.<br><br>http://www.facebook.com/3dfila<br>https://twitter.com/3DFila_Brasil<br>https://3dfila.com.br/'}
{'@type': 'html', 'value': "Website doesn't work, Facebook doesn't work, full address:\xa0 Lai Lai Center, 1º Andar, Loja 105<br>Centro<br>Alto Parana 7000<br>Paraguai"}
{'@type': 'html', 'value': "Website doesn't work"}
Websites work
Website, Twitter, Facebook work.
Does biomedical stuff. Website works. https://www.printerize3d-scv.com.br/
{'@type': 'html', 'value': "Website doesn't work, Facebook works, LinkedIn works. Offers training."}
{'@type': 'html', 'value': "Website and Instagram don't work, Facebook works.\xa0"}
{'@type': 'html', 'value': "Facebook doesn't work. Website works."}
{'@type': 'html', 'value': "Facebook and Twitter work. Instagram doesn't work."}
{'@type': 'html', 'value': 'Website works.<br><br>http://www.impressao3dfacil.com.br'}
{'@type': 'html', 'value': 'Facebook, Twitter, and website work.<br><br>https://www.facebook.com/oaloobr<br>https://twitter.com/oaloobr<br>https://www.oaloo.com.br/'}
{'@type': 'html', 'value': 'Facebook, Instagram, and website work.<br><br>https://www.facebook.com/PRINTITTRESD/<br>https://www.instagram.com/printit_3d/<br>https://www.printit3d.com.br/'}
{'@type': 'html', 'value': 'Facebook, Twitter, and website work.<br><br>https://www.facebook.com/printgreen3d<br>https://twitter.com/printgreen3d<br>http://www.printgreen3d.com.br/'}
{'@type': 'html', 'value': 'Facebook, Instagram, and website work.<br><br>https://www.facebook.com/r3dyoficial<br>https://www.instagram.com/r3dyoficial<br>https://www.r3dy.com.br'}
{'@type': 'html', 'value': 'Facebook and website work.<br><br>https://www.facebook.com/prototipagemImpressao3D<br>http://www.rgimpressao3d.com.br'}
{'@type': 'html', 'value': "Website doesn't work.<br><br>http://tdtec.com.br/"}
{'@type': 'html', 'value': 'Facebook, Twitter, and website work.<br><br>https://www.facebook.com/usinafablab/<br>https://twitter.com/UsinaFablab<br>https://www.usinafablab.com.br/'}
{'@type': 'html', 'value': 'Website works.<br><br>http://www.adororobotica.com'}
{'@type': 'html', 'value': 'Facebook and website work.<br><br>https://www.facebook.com/fablabcba?ref=hl<br>https://www.fablabs.io/labs/fablabcuiaba'}
{'@type': 'html', 'value': 'Facebook works.<br><br>https://www.facebook.com/oficinamakerpvh/'}
Unique data rows:  22
In [9]:
data = data.drop(columns=['description'])
In [10]:
report = {'okw_columns': list(data.columns),} 
In [11]:
data
Out[11]:
name geometry
0 Impresión 3D Infinity Makers (ELF Maker) POINT Z (-99.24064 19.01404 0.00000)
1 Impresión 3D y Electrónica. DIAC-3D POINT Z (-99.24327 18.92853 0.00000)
2 IMPRESIÓN 3D Taller de diseño POINT Z (-99.20824 18.92575 0.00000)
3 Jart Studio (sucursal) - Impresión 3D POINT Z (-99.14459 18.87846 0.00000)
4 3DZone S.A. de C.V. POINT Z (-99.19289 18.93332 0.00000)
... ... ...
1633 Usina Fab Lab POINT Z (-51.20268 -30.03210 0.00000)
1634 Adoro Robótica Makerspace POINT Z (-43.18070 -22.94186 0.00000)
1635 MXPCB POINT Z (-89.64009 21.04402 0.00000)
1636 FabLab Cuiabá-BR POINT Z (-47.43440 -23.47281 0.00000)
1637 Oficina Maker POINT Z (-63.85365 -8.75318 0.00000)

1638 rows × 2 columns

In [12]:
data.name.unique().shape
Out[12]:
(1132,)
In [13]:
data.geometry.unique().shape
Out[13]:
(1178,)
In [14]:
data = data.loc[data.drop_duplicates(subset=['geometry']).index]

3¶

Reversed search, finding keywords by using Natural Language Toolkit¶

In [15]:
import nltk
nltk.download('stopwords')
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
from nltk.corpus import stopwords
from nltk.stem.snowball import SnowballStemmer
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\ANAYA\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
In [16]:
# preprocess text data
data['cl_name'] = data['name'].str.lower().str.replace('[^\w\s]','', regex=True).str.strip()

stop_words = set(stopwords.words('spanish'))
data['cl_name'] = data['cl_name'].apply(lambda x: ' '.join([word for word in x.split() if word not in stop_words]))

stemmer = SnowballStemmer("spanish")
data['cl_name'] = data['cl_name'].apply(lambda x: ' '.join([stemmer.stem(word) for word in x.split()]))


# vectorize text data using TF-IDF
vectorizer = TfidfVectorizer()
vec_ftrans = vectorizer.fit_transform(data['cl_name'])

# cluster common words using k-means
kmeans = KMeans(n_clusters=5, random_state=0, n_init=5)
kmeans.fit(vec_ftrans)

# add cluster labels to dataframe
data['cluster'] = kmeans.labels_

cluster_names = [str([k for k,v in vectorizer.vocabulary_.items() if v == center.argsort()[-1]][0]) for i, center in enumerate(kmeans.cluster_centers_)]
In [17]:
import seaborn as sns

cluster_counts = data.drop(columns='geometry').groupby('cluster').size().reset_index(name='count')

# Sort by frequency
cluster_counts = cluster_counts.sort_values('count', ascending=False)

# Plot frequency of clusters
sns.set_style('ticks')
sns.barplot(x='cluster', y='count', data=cluster_counts)
Out[17]:
<AxesSubplot: xlabel='cluster', ylabel='count'>
In [18]:
cluster_counts
Out[18]:
cluster count
4 4 719
2 2 278
1 1 115
0 0 52
3 3 14

4¶

Reverse geocoding¶

In [19]:
geocodes = pd.read_csv('rg_cities1000.csv')

# Create a KDTree from the lat-lon coordinates in the geocodes DataFrame
tree = KDTree(geocodes[['lat', 'lon']])
In [20]:
def get_country_code(latlong):
    lat, lon = latlong
    _, idx = tree.query([lat, lon])
    return geocodes.iloc[idx]['cc']

def get_city(latlong):
    lat, lon = latlong
    _, idx = tree.query([lat, lon])
    return geocodes.iloc[idx]['name']
In [21]:
%%time
data['country'] = list(map(get_country_code, data['geometry'].apply(lambda geom: (geom.y, geom.x))))
CPU times: total: 15.6 ms
Wall time: 76.4 ms
In [22]:
%%time
data['city'] = list(map(get_city, data['geometry'].apply(lambda geom: (geom.y, geom.x))))
CPU times: total: 15.6 ms
Wall time: 76.4 ms
In [23]:
data.drop(columns='geometry').groupby('country').size().sort_values(ascending=False)
Out[23]:
country
MX    601
BR    204
AR    125
CO     48
CL     37
BO     19
EC     19
PE     13
DO     13
PY     11
VE     11
HN     11
GT     10
PR     10
JM     10
CR      9
SV      8
UY      8
PA      4
NI      3
BZ      2
US      2
dtype: int64
In [24]:
sns.countplot(y='country', data=data.drop(columns='geometry'), order=data['country'].value_counts().index)
Out[24]:
<AxesSubplot: xlabel='count', ylabel='country'>
In [25]:
data.drop(columns='geometry').groupby('city').size().sort_values(ascending=False)
Out[25]:
city
La Paz                     19
Cancun                     19
Hermosillo                 19
Tuxtla Gutierrez           18
Xalapa de Enriquez         17
                           ..
Ejido Javier Rojo Gomez     1
Duque de Caxias             1
Paulinia                    1
Duitama                     1
Zaragoza                    1
Length: 420, dtype: int64
In [26]:
top_cities = data['city'].value_counts().nlargest(20).index
sns.countplot(y='city', data=data[data['city'].isin(top_cities)].drop(columns='geometry'), order=top_cities)
Out[26]:
<AxesSubplot: xlabel='count', ylabel='city'>
In [37]:
#### point_max = (data.geometry.y.max(),  data.geometry.x.max())
point_min = (data.geometry.y.min(),  data.geometry.x.min())

main_map = folium.Map(zoom_start=2)

main_map.fit_bounds((point_min, point_max))

marker_layer = folium.FeatureGroup(name='Markers', show=False)

popup = folium.features.GeoJsonPopup(
    fields=['name', 'country'],
    aliases=['Name:', 'Country:'],
    localize=True,
    sticky=False,
    labels=True,
    style="font-size: 12px;",
)
 
geojson = folium.GeoJson(
    data=data,
    popup=popup,
    marker=folium.Marker(icon=folium.features.Icon()),
).add_to(marker_layer)

main_map.add_child(marker_layer)

heatmap_data = [[row['geometry'].y, row['geometry'].x] for index, row in data.iterrows()]
heatmap_layer = folium.FeatureGroup(name='3D print')
heatmap_layer.add_child(HeatMap(heatmap_data, opacity=0.1, radius=8))

main_map.add_child(heatmap_layer)

logo_url = 'iopa_logo_okw.png'
logo_size = (10, 10)
icon = folium.features.CustomIcon(logo_url, icon_size=logo_size)
span = 1
float_image = FloatImage(logo_url, bottom=span, left=span, width=logo_size[0], height=logo_size[1])

main_map.add_child(float_image)

minimap = MiniMap()

main_map.add_child(minimap)

folium.TileLayer('openstreetmap').add_to(main_map)

map_control = folium.LayerControl(name='Base Maps', collapsed=True)

main_map.add_child(map_control)
Out[37]:
Make this Notebook Trusted to load map: File -> Trust Notebook
In [38]:
# main_map

Interacitive map¶

1. Usage: Click on top-right corner click selector to switch on/off the interactive Markers.
In [28]:
report['cities_top_20'] = data['city'].value_counts().nlargest(20)
report['countries_top_5'] = data['country'].value_counts().nlargest(5)

def list_to_markdown(data):
    z_ = list(zip(data.index, data.values))
    locations = ['\t\t . {}'.format(j) for j in z_]
    return Markdown('\n'.join(locations).replace('(','').replace(')','').replace('[','').replace(']','').replace(',', ''))
In [29]:
from IPython.display import Markdown

counts = [val[1] for val in reversed(cluster_counts.values)]
cluster_zip = [str(x[0]) + ': ' + str(x[1]) for x in list(zip(cluster_names, counts))]


display(Markdown(f'''
## Findings:

    1. Columns that are related to the OKW standard are:
        
        a. {report['okw_columns']}
    
    2. Completeness compared to the OKW simplified data schema:
    
        a. Total database schema field numbers: 17,
        b. Percentage of covered: {100 * int(report['okw_columns'].__len__()) // 17 } %.

    3. Reverse keyword analysis (reverse search) in 'Spanish' language had the results:
    
        a. {cluster_zip} 
'''))

display(Markdown(f'''
    4. Total unique locations:

        a. {data.shape[0]}.

    5. Verified locations:

        a. Unverified locations. No references provided. 
        b. An unknown number of locations may appear in Google Places/Maps.

    6. The reverse geolocation had the results of locations by top 5 countries and top 20 cities:
'''))

Markdown(f'''
        {display(list_to_markdown(report['countries_top_5']))}
        {display(list_to_markdown(report['cities_top_20']))}
        ''')

display(Markdown(f'''
    7. Origin of data:
        
        a. Not raw data or origin stated in the files. Not aditional documentation or methods provided.
        
    8. Possible origin of data based on inspected data:

        a. Possibility of Google Places API. Query for '3D printing' or 'Impresion 3D', radio 1000 m. For a specified list of locations.
        b. Other sources may be part of the data points for example the file provided: ['Mapping_stats Brazil fazedores.xslx']

    9. Observations:

        a. The data provided contains data points without references, collection methods or verified sources.

    10. Quality:

        a. Evaluation results: Low, based on points 1, 2, 5 and 7.
'''))

Findings:¶

1. Columns that are related to the OKW standard are:

    a. ['name', 'geometry']

2. Completeness compared to the OKW simplified data schema:

    a. Total database schema field numbers: 17,
    b. Percentage of covered: 11 %.

3. Reverse keyword analysis (reverse search) in 'Spanish' language had the results:

    a. ['print: 14', 'las: 52', 'impresion: 115', 'printing: 278', '3d: 719'] 
4. Total unique locations:

    a. 1178.

5. Verified locations:

    a. Unverified locations. No references provided. 
    b. An unknown number of locations may appear in Google Places/Maps.

6. The reverse geolocation had the results of locations by top 5 countries and top 20 cities:
     . 'MX' 601
     . 'BR' 204
     . 'AR' 125
     . 'CO' 48
     . 'CL' 37
     . 'Hermosillo' 19
     . 'La Paz' 19
     . 'Cancun' 19
     . 'Tuxtla Gutierrez' 18
     . 'Xalapa de Enriquez' 17
     . 'Merida' 16
     . 'Belo Horizonte' 16
     . 'Veracruz' 14
     . 'Aguascalientes' 13
     . 'Brasilia' 13
     . 'San Luis Potosi' 12
     . 'Puebla' 12
     . 'Playa del Carmen' 12
     . 'Morelia' 12
     . 'Santa Fe de la Vera Cruz' 11
     . 'Chihuahua' 11
     . 'Campeche' 11
     . 'Cordoba' 10
     . 'Bahia Blanca' 10
     . 'San Luis' 10
7. Origin of data:

    a. Not raw data or origin stated in the files. Not aditional documentation or methods provided.

8. Possible origin of data based on inspected data:

    a. Possibility of Google Places API. Query for '3D printing' or 'Impresion 3D', radio 1000 m. For a specified list of locations.
    b. Other sources may be part of the data points for example the file provided: ['Mapping_stats Brazil fazedores.xslx']

9. Observations:

    a. The data provided contains data points without references, collection methods or verified sources.

10. Quality:

    a. Evaluation results: Low, based on points 1, 2, 5 and 7.

Recommendations:¶

1. Review, the purposes of the Data Awards data collection, and the OKW standard.
2. Provide this list of resources:
    a. Data origin, raw data or reference used datasets.
    b. Filtering, and verification methods, computational or qualitative.
    c. Findings and analysis based on verifiable data.
    d. Collection methods.
In [ ]:
 
In [ ]: