22 : Anonymized Social Profiles from Blockchain Transactions
Problem Statement
It is worth wondering whether similarity in financial interests and asset management techniques could prove to be indicators of whether two people would interact and become friends similar to how people with common friends and social interests do. The aim of the project is to analyze people’s financial data, use this to infer their financial profiles (interests, tendencies, etc.) and study how they interact with others having similar profiles. People having similar profiles will interact more among one another and we expect the existence of tight-knit groups/communities. We make use of publicly available, anonymous blockchain data from the Ethereum blockchain and attempt to answer four research questions which are as follows. Are there any organizations or communities depending on how individuals use blockchain to trade or manage their assets? Is it possible to tell whether a single person has two wallet addresses? Which characteristics have the most influence on predicting user similarity? Can we create user profiles based on their transactional history to define their financial behavior? We attempt to address these questions in the following text.
Applications
Such a study would have wide-ranging applications including:
1. Prediction of economical events and trends
2. Friend recommendations for a finance-based social media platform
3. Identification of malicious users evading ban, etc.
1. Airdrops
2. Snapshot Voting
3. Wash Trading Artificial floor raises for NFT prices
4. Focused analysis - For instance people wishing to study NFT users can only focus on NFT transacting users as our analysis revealed minimal overlap between NFT and DEFI users
Related work
1. Defining user spectra to classify Ethereum users based on their behavior
They extract some features for users of three to four smart contracts and use these features to summarize user behavior during a particular time period. They build a user graph using the transactions and then extract features from it. They also compare the "spectrum" of a user to a class of users.
2. A Data Science Approach for Honeypot Detection in Ethereum
They present a data science detection approach based foremost on the contract transaction behavior. They create a partition of all the possible cases of fund movements between the contract creator, the contract, the transaction sender and other participants
3. High Performance Classification Model to Identify Ransomware Payments for Heterogeneous Bitcoin Networks
They propose a high performance Bitcoin transaction predictive system that investigates the Bitcoin payment transactions to learn data patterns that can recognize and classify ransomware payments for heterogeneous bitcoin networks. Their system makes use of two supervised machine learning methods to learn the distinguishing patterns in Bitcoin payment transactions, namely, shallow neural networks (SNN) and optimizable decision trees (ODT).
Methodology pipeline
Data Collection
For our analysis we considered the list of users that were transacting with the top 100 DEFI and NFT platforms. To get the list of top platforms we used the rankings on DEFI Pulse, DEFI Llama and DEFI Prime. These websites provide latest analytics and rankings for DEFI protocols. After getting the rank list we made a set of the token addresses of these platforms manually.
Once we had the token addresses of these platforms, we were able to get the list of transactions for each of these platforms using API calls to Etherscan API. Etherscan API provides free and consistent data on Ethereum transactions. These transactions listed the wallet addresses of the users transacting with that DEFI platform. By recording these addresses we obtained a list of users for each platform.
Once we have the wallet addresses, we can again use Etherscan API to get information on each of these users such as transactions, wallet balance, amount of tokens purchased or sold and so on.
Feature Extraction
We now need to extract the data relevant to our analysis from the data collected using API calls and convert it into a format that is convenient for our analysis. To do this, we create a feature vector for each user. Each feature vector contains relevant data on that particular user, such as:
1. The ETH balance of the user
2. A list of their token and NFT transfers
3. The platforms they have interacted with. This number of times they interacted with a protocol and their average number of transactions per day on that protocol
4. Their total number of transactions and average transactions per day across all protocols

Similarity Scoring
Once the feature vectors are created, we need to create a scheme for performing similarity checking between any two users. For this purpose, we define a scoring function for each scalar feature within the feature vector, as follows:

This function gives a value between 0 and 1 that acts as a measure of similarity between the two users in terns of that particular feature. The intuition behind the function is that if two users have high similarity, the difference between the values will be low and thus the function will return a higher similarity score.
To get the overall similarity score between two users, we need to compare their feature vectors. To compare two vectors we take a weighted sum of the similarity score for each individual feature. 
We scale the weights so that they add up to 1. As a result of this, they act as probabilities. We use this weighted sum to get the final score, scaled to 0-100 to get a percentage, as follows:
We use these scores to cluster users to retrieve communities in the user graph.
Survey
Finally we conducted a survey for DEFI users to finetune and verify our results and to get additional insights. Our questions were concerned with their interest in DEFI, NFTs and DAO, what related communities they were a part of on other social networks, if they participated in certain tyoes of transactions and so on. We also asked more personal question such as if they had met other users of these platforms or if they had made friends in the community. We also gained some additional insights that weren’t directly apparent from the collected data. For example, all the responses we got said that they owned multiple wallet addresses.
Ablations
One of the most important aspects for any CSS project is the data. The transaction data that was used in our project was scraped from Etherscan open API. As we could not afford a premium service for the same, our API calls were limited to a certain rate. The same caused significant delay even for collecting a small amount of data. Moreover, we faced issues at times with their API call service as some of the api calls returned error without much information as to what caused the error. Still we continued with data scraping from this site as it offered the best API call rates for a free level blockchain transaction platform API.
Similarity scoring algorithm is a major backbone of our analyses. The similarity between two addresses using the transactions is what we need to identify in order to cluster the two addresses appropriately. The initial similarity scoring algorithm that we came up with was unable to adequately predict similarity. As a result, the nodes on the network graph that we ended up with did not make much sense and depicted a network where either the nodes were too close or too far apart. With the help of thorough debugging over the course of some days, we were able to accurately figure out why the algorithm was returning erroneous similarity scores. After that, we tried to fix our algorithm by including weighted scores, scores of their nft platforms, etc. This helped fix our issue and the similarity scores returned by the algorithm were no longer erroneous and displayed similarity more accurately.
Another setback was the fact that our survey had few responses. One of the main reasons for this that the blockchain culture has still not really caught the scene in India. As a direct result of this, even if we circulated the survey to all the colleagues, acquaintances, and friends we know, only a small percentage of them would be qualified to fill out the survey.
Finally, the clustering approach did not work as we intended it to. The data we have still falls short to properly deduce anything from the clustering graphs. Moreover, visualizing the clusters even within the smallest amount of data takes a considerable amount of time. With more time and computational resources, we would have definitely come up with proper visualized clusters within our data.
Analysis

The data for user category analysis was collected over a period of three days. The user category analysis showed that only 1-2% (visualized in the pie chart below) of all users transact in both categories of DeFi and NFTs. This seems to confirms our hypothesis that there are these semi-rigid boundaries and people can be clustered according to their social behavior on blockchains.
Next we do a network analysis.

The above image is a network graph showing a group of NFT users denoted by blue and NFT platforms by red and these users had at most 2 protocol transactions. Interesting insights to draw from here are that:
1. This red one is OpenSea and it has a lot of incoming edges showing its popularity. So multiple users are using the same protocol (OpenSea)
2. Then we have a small cluster over here where 2 users using the same 2 protocols showing a higher matching of interests.
3. Here, 2 users interested in multiple protocols having a single protocol of intersecting interest.
4. Finally, here, only 1 person is using 2 protocols that no one else uses making it harder to categorize that user.

Moreover, the Dex analysis shown above depicts that many users stay loyal to the DEX they use. Almost 50% of Uniswap users and 60% of Quickswap users use that particular DEX only.
This is due to the fact that Uniswap is the most popular DEX on Ethereum and Quickswap on Polygon.
We see that a very high similarity score is achieved for accounts owned by same user. We know this because on a deeper manual analysis, we see an almost identical transaction and NFT/tokens timeline and transactions between the two addresses to circulate money. This can be misused in multiple ways as mentioned earlier during applications section. This is very similar to the Wikipedia ban evasion paper by Prof. Srijan Kumar and others as multiple wallet addresses resemble all the characteristics of parent and child accounts.
Conclusion
The major outcome to take away from this project is that there certainly are tight-knit communities in the blockchain space. We found the same using different methods of analysis: categorical and network based as showcased in the previous sections. Moreover, a significant percentage of the users even stay loyal to the dex they use. However, as mentioned earlier our project had some limitations. These limitations did not severely obstruct our application or research questions as such but it certainly did constrict the boundaries for our project.
If these limitations were to overcome, by chance or capacity, we can extend our project to further application based systems. For instance, if the API issue is resolved, we can have a gold mine of transaction data to work with. We can use the same to better model our analyses or even try to frame a model for predicting illegal or unethical activities. Such activities generally include a user making the same transactions through different wallet addresses to gain some sort of unfair advantage. Furthermore, we can try to enhance our similarity score awarding algorithm. As mentioned earlier, we faced some trouble with the same early on and fixed it but it can still be improved upon, especially if we are considering scaling up our project. Moreover, the clustering approach that we talked about in the earlier sections could be resolved if we can allot more time and computational resources. Finally, it might be plausible to make use of graph mining algorithms for predicting the unethical activities mentioned above. Graph mining algorithms will help us to identify and locate recurring structures within our social graph. The presence of such structures makes it easier for us to detect possible cases of blackmailing, fraud/scam, ransom demands, etc.
Comments
Post a Comment