Skip to content Skip to sidebar Skip to footer

New Column With Coordinates Using Geopy Pandas

I have a df: import pandas as pd import numpy as np import datetime as DT import hmac from geopy.geocoders import Nominatim from geopy.distance import vincenty df city_name

Solution 1:

You can call apply and pass the function you want to execute on every row like the following:

In [9]:

geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
    city_name state_name       county_name  \
0  WASHINGTON         DC  DIST OF COLUMBIA   
1  WASHINGTON         DC  DIST OF COLUMBIA   

                                          city_coord  
0  (District of Columbia, United States of Americ...  
1  (District of Columbia, United States of Americ...  

You can then access the latitude and longitude attributes:

In [16]:

df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

Or do it in a one liner by calling apply twice:

In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df

Out[17]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

Also your attempt geolocator.geocode(lambda row: 'state_name' (row)) did nothing hence why you have a column full of None values

EDIT

@leb makes an interesting point here, if you have many duplicate values then it'll be more performant to geocode for each unique value and then add this:

In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d

Out[38]:
{'DC': (38.8937154, -76.9877934586326)}

In [40]:    
df['city_coord'] = df['state_name'].map(d)
df

Out[40]:
    city_name state_name       county_name                       city_coord
0  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)
1  WASHINGTON         DC  DIST OF COLUMBIA  (38.8937154, -76.9877934586326)

So the above gets all the unique values using unique, constructs a dict from them and then calls map to perform the lookup and add the coords, this will be more efficient than trying to geocode row-wise

Solution 2:

Upvote and accept @EdChum's answer, I just wanted to add to this. His methods works perfect, but from personal experience I'd like to share a few things:

When dealing with geocoding, if you have multiple city/state combination that are repeating, it's much faster to send only 1 to get geocoded and then replicate the rest to other rows below:

This is very helpful for large data can be done through two ways:

  1. Based on your data only since the rows seem exact duplicate, and only if you want, drop the extra ones and execute geocoding to one of them. This can be done using drop_duplicate
  2. If you want to keep all your rows, group_by the city/state combination, apply geocoding to it the first one by calling head(1), then duplicate to the remainder rows.

Reason is each time you call on Nominatim there's a small latency issue even if you were queuing the same city/state in a row. This small latency gets worse when your data gets large causing a huge delay in response and possible time out.

Again, this is all from personanly dealing with it. Just keep in mind for future use if it doesn't benefit you now.

Post a Comment for "New Column With Coordinates Using Geopy Pandas"