New Column With Coordinates Using Geopy Pandas
Solution 1:
You can call apply
and pass the function you want to execute on every row like the following:
In [9]:
geolocator = Nominatim()
df['city_coord'] = df['state_name'].apply(geolocator.geocode)
df
Out[9]:
city_name state_name county_name \
0 WASHINGTON DC DIST OF COLUMBIA
1 WASHINGTON DC DIST OF COLUMBIA
city_coord
0 (District of Columbia, United States of Americ...
1 (District of Columbia, United States of Americ...
You can then access the latitude and longitude attributes:
In [16]:
df['city_coord'] = df['city_coord'].apply(lambda x: (x.latitude, x.longitude))
df
Out[16]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
Or do it in a one liner by calling apply
twice:
In [17]:
df['city_coord'] = df['state_name'].apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))
df
Out[17]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
Also your attempt geolocator.geocode(lambda row: 'state_name' (row))
did nothing hence why you have a column full of None
values
EDIT
@leb makes an interesting point here, if you have many duplicate values then it'll be more performant to geocode for each unique value and then add this:
In [38]:
states = df['state_name'].unique()
d = dict(zip(states, pd.Series(states).apply(geolocator.geocode).apply(lambda x: (x.latitude, x.longitude))))
d
Out[38]:
{'DC': (38.8937154, -76.9877934586326)}
In [40]:
df['city_coord'] = df['state_name'].map(d)
df
Out[40]:
city_name state_name county_name city_coord
0 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
1 WASHINGTON DC DIST OF COLUMBIA (38.8937154, -76.9877934586326)
So the above gets all the unique values using unique
, constructs a dict from them and then calls map
to perform the lookup and add the coords, this will be more efficient than trying to geocode row-wise
Solution 2:
Upvote and accept @EdChum's answer, I just wanted to add to this. His methods works perfect, but from personal experience I'd like to share a few things:
When dealing with geocoding, if you have multiple city/state combination that are repeating, it's much faster to send only 1 to get geocoded and then replicate the rest to other rows below:
This is very helpful for large data can be done through two ways:
- Based on your data only since the rows seem exact duplicate, and only if you want, drop the extra ones and execute geocoding to one of them. This can be done using
drop_duplicate
- If you want to keep all your rows,
group_by
the city/state combination, apply geocoding to it the first one by callinghead(1)
, then duplicate to the remainder rows.
Reason is each time you call on Nominatim there's a small latency issue even if you were queuing the same city/state in a row. This small latency gets worse when your data gets large causing a huge delay in response and possible time out.
Again, this is all from personanly dealing with it. Just keep in mind for future use if it doesn't benefit you now.
Post a Comment for "New Column With Coordinates Using Geopy Pandas"