DBSCAN Clustering

#DBSCAN Clustering

Assuming the csv file having ‘lat’ and ‘lon’ as the header for the latitude and longitude data.

– importing moudles

– define the number of kilometers in one radian

– load the data set

– represent points consistently as (lat, lon)

– define epsilon as 1.5 kilometers, converted to radians for use by haversine

– get the number of clusters

– all done, print the outcome

– Find the point in each cluster that is closest to its centroid

– unzip the list of centermost points (lat, lon) tuples into separate lat and lon lists

– from these lats/lons create a new df of one representative point for each cluster

– pull row from original data set where lat/lon match the lat/lon of each row of representative points that way we get the full details like city, country, and date from the original dataframe

– taking 6 digit of latitude and longitude after decimal point

– plot the final reduced set of coordinate points vs the original full set

Thanks to :
https://github.com/gboeing/2014-summer-travels/blob/master/clustering-scikitlearn.ipynb

Clustering to Reduce Spatial Data Set Size