Using a HBD/Hive ticker history to improve a Machine Learning based auto-voter

in #hive-139531last year

Introduction

I've been experimenting with a Machine Learning (ML) model to predict the rewards of a given post.

I thought that, maybe, using the total post HBD reward I retrieved on HiveSQL was not perfect. I came to think that the real driving force was the HP and thus, I should use the Hive reward on the post, and not the HBD reward. I was thinking that the variation of the HBD/Hive would impact the real "power" a post received.

I wanted to explore this path as the decisive feature of my ML model is what I called the "mean article payout". It is the average payout a poster got on his posts during the previous month.

As you can see on this graph, it accounts for more than 60% of the prediction:

The main variables in my ML Model

The length of the post is the second most important factor, followed by the account's reputation.
The latter had to be expected, as your reputation increases when you get upvotes. A high reputation denotes a lot of votes in the past. The "mean article payout" and the "account reputation" are thus dependent correlated variables, which is not a very good thing in ML Models.

Data Prep

After setting up my table with HDB/Hive values, I modified my Talend job (an ETL tool) to compute the Hive value for each post.

The Talend job is a crucial part of the data preparation in my model :

My voter architecture

That's my pipeline architecture to illustrate the role of the Talend job.

There was a last question to answer : which HDB/Hive conversion rate should I choose? The one from the day of writing, or the one from the day of reward ?.
Having no insight into this question, I just ran some tests.

Tests

I decided to test both and here are the results :

Long story short:

my idea was bad !!

I compared four situations :

  • The original data source: pure HBD on the target (the post final reward) and on average last month post's final reward.
  • HBD converted to Hive using the day of writing rate for both values
  • HBD converted to Hive using the day of reward rate for both values
  • HBD converted to Hive at :
    • day of reward for the average last month post final reward.
    • day of writing for the post final reward.

There was no ML training involved. I just tested the correlation between the two variables with the exact same set of data.

The results are clear: the original way is the best. There is no correlation improvement when converting to Hive, all my tests perform worse than my original data source.

Thoughts

I considered that, if my most important variable is less correlated, I'll have worse results on my model. That may not be exactly true. There could be other informations that the model could use that would explain it and that the ML model could take into account to achieve better predictions.

Still, I don't think it would justify the hassle of changing my whole pipeline.

Future

I may have an idea to use my HBD/Hive ticker in the future, but I have some searches to do first, and I'll make a post to cover it if I'm convinced.

Concerning the ML Model: it's working as expected. It's not very efficient but it works.
The main exploration path to increase the quality of the predictions is to find one or two more relevant variables. A good way to achieve this would be to isolate some authors whose post performs very differently from one to another.
A "manual curation" of those posts could maybe lead me to some interesting insights.
From my recent posts, I realized that :

  • The posting community has a strong impact. At the moment, it's lost amongst all the tags in my model.
  • It seems that without the Leofinance tag, a post won't appear on the leofinance interface (at least I don't see them in the "My Posts" tab). This has a big impact on the number of viewers.
  • Some tags have a specific voter, with lots of trailing voters.
  • Maybe the number (or presence) of links and pictures influence what a voter would consider a "good" or "bad" post.
  • Many pictures could also artificially increase the post body size and should be removed.

Even if I don't really need to improve the prediction, those leads could be interesting to follow, just to build up my personal knowledge and understanding of ML models, and the Hive ecosystem as a whole!

Notes :

The introduction image was created with Bing Image Creator and the logos added with The Gimp.

Sort:  

This comment was made by a hive-archeology bot running under the control of @pibara
image.png
The goal of this comment is to act as reward proxy for up-voting valuable timeless content on HIVE for what the one-week upvote window has closed.
The bot script is currently pre-beta.

Nothing is guaranteed in the Hive. I guess your test confirmed that. Apart from automatic votes, I think there is a lot more people should do to get votes.

  • Engagement
  • Quality posts
  • Posting in Right Community
  • Using Correct Tags
  • Good Thumbnail and Title

Ooops, I hope my posts are not concerned by all of your points!

I must admit that I don't have a single clue why this particular post performs so bad compared to the previous one.

I used a grammar and syntax app to check my English (first time), I made a special thumbnail pic, writing it took me a good amount of time, and the topic and the tags are comparable.

Concerning the engagement, I try... It's difficult cause I'm not the kind of person who's comfortable talking about a subject I don't know, and I don't like to do smalltalks.

I try to follow the Dev community, but I'm more a Data guy than a Dev guy, so many post are our of my area of expertise.

You don't need to talk about you don't know and you are doing good by not doing smalltalks because it's fake. However we have a lot communities and there are posts for everyone. For example travel, photography, general life, gaming, movies & TV shows. I am sure some of these topics are something you can talk about.

You're right, I probably miss interesting content here and there!

I'll look around to find more stuff I like!