in #hive2 years ago

user locations.png

I've turned your data into a chart. BTW, you should add an 'unknown' for users who have no location data at all, and 'uncategorized' for users who have location data but don't match your filter.

You should also post it to Hive Statistics community :-P

Sort:  

Here is the chart when we factor in users with no location data + users with uncategorized location data.

Locations of Hive Users Active in Last Month.png

This is the real insight IMO! Very very interesting :)

There are lots of countries I have not even checked. I can look at adding more, but it means researching what names might be used as there can be variations and people are not consistent in what they use. For the Brits I check UK, GB, England/Scotland/Wales/N. Ireland and variations on those. Even that does not find everyone. This is why we need local knowledge from people with the time to maintain their local list. We could coordinate to combine the data into charts like yours. It's no simple job.

It doesn't need to be comprehensive. By adding 'unknown' and 'uncategorized' you can start narrowing down what is missing. Can you send me your query and I'll make a few additions?

I am running SQL queries, so I only get what matches. I don't have a query for the non-matches. The code is here. You need a HiveSQL key to use it.

This query gives some useful results but they need cleaning up (lots of variations that mean the same thing)

select count(*) as count, JSON_VALUE(posting_json_metadata,'$.profile.location') as location
from Accounts a
WHERE a.posting_json_metadata like '%"location":%'
AND (a.last_vote_time >= '2022-05-23 00:00:00.000' or a.last_post >= '2022-05-23 00:00:00.000')
GROUP BY JSON_VALUE(posting_json_metadata,'$.profile.location')
ORDER BY count DESC

That will get a lot of results, but could make a list we could work from. Queries need to exclude false results, e.g. 'Northern Ireland' contains 'Ireland'. Various city names exist in multiple countries.

I just ran that query. We have 18 each on the Moon and Mars :) Lots of cities listed without a country.

Compiling a comprehensive list will involve a fair bit of manual work. I check each person I add for the Brit List. I have a query that looks at lots of locations, but excludes any I already have.

I'm doing some manual cleanup of the data as we speak. It won't be perfect, I will not check every account manually.

Have fun! As someone else said there are tools to process locations, but the data is really messy. I'll be interested to see a version that includes more of the active users. I know a lot in the USA got missed out. For the Brits I have some in my file who you might not find just with SQL.

You can get the total number of active accounts in the time period with this:

select count(*) as count
from Accounts a
WHERE (a.last_vote_time >= '2022-05-23 00:00:00.000' or a.last_post >= '2022-05-23 00:00:00.000')

And the number of the same with location data set with this:

select count(*) as count
from Accounts a
where a.posting_json_metadata like '%"location":%'
and (a.last_vote_time >= '2022-05-23 00:00:00.000' or a.last_post >= '2022-05-23 00:00:00.000')

That amounts to 25458 and 13804 respectively. The rest is just subtraction.

Worth noting that if you're not joining on anything, you don't need to alias Accounts or provide the a. on your Where statements since the only table it's referencing is Accounts and therefore the identifier is unique by default.