Hello & good day dear community! My name is Victor and I am gifting user-requested software to help the fellows and make the world a better place.
The latest published program is MovieList-compn:
The backstory:
User compn asked for help on the DonationCoder forum. He said:
I have a list of filenames, and my friend has a list of filenames. we are trying to organize and sort and compare our lists of filenames. but the two lists are too different for most string comparing tools.
As part of the conversation, fuzzy string matching was recommended via the "FuzzyWuzzy" python library.
We agreed that some form of fuzzy matching was a must to be implemented, yet since the MovieList program is done in C#, an alternative library would be used.
Another point that came up was related to dealing with foreign characters in the many movie titles to be compared. The solution was to convert such foreign language characters to ASCII for internal comparison, while showing users the original titles afterward.
Example:
Woman and Gramophone Johannes Stjärne Nilsson & Ola Simonsson, 2000
Stjärne => Stjarne
Finally, in order to streamline comparison, function words had to be removed in order to allow for a better fuzzy scoring, with less "noise".
These function words made it to the list:
a, an, the, this, that, these, those, my, your, their, our, some, many, few, all, and, but, or, so, because, although, in, of, on, with, by, at, over, under, he, she, it, they, we, you, me, him, her, is, am, are, was, were, has, have, had, can, could, may, might, shall, should, will, would, must, who, what, when, where, why, how
...With the following processing conditions:
- There must be two or more words in the original title.
- At least one word must remain after filtering is done.
This way, all movies with a single function word title (such as "it" or "her") would be skipped, while the others would be processed normally.
Implemented Algorithm:
- Fold foreign characters to ASCII.
- Delete non-ASCII characters.
- Remove function words (while caching original titles).
- Apply direct title comparison or one of the fuzzy algorithms available in the library.
- Generate entries for the matches.txt and unmatched.txt files using this step’s comparison results.
- Create entries for the collisions.txt file when more than one movie file points to the same title in the resulting set.
- Save files to disk by retrieving the original/cached titles, as expected by the user.
Libraries:
MovieFileLibrary
The first and most obvious issue that compn faced was related with extra movie information (such as video format, file extension and actor names) having to be compared against the title.
Thanks to this library, you can get the actual title and year of a movie, assisting both direct and fuzzy comparisons by feeding clean information, coming from oddly-named files.
FuzzySharp
FuzzySharp is a C# implementation of Python's FuzzyWuzzy algorithm, which was originally recommended in the thread. It is configurable by algorithm and cutoff value, gladly allowing you to fine-tune the output for a higher percentage of correct matches.
- Command Line Parser
This CLI parser simplifies command options implementation and automatically documents usage (awesome!). It enables short (single-character) parameters and longer ones with full words. The auto-generated "--help" command screen enhanced the program usability big time. Also, the "–version" option reads from the program’s AssemblyInfo directly, ensuring it always reflects the latest version number without manual adjustments.
We will definitely be using this library again in our Paradisus project's future releases.
Source code and GitHub downloads:
GUI:
This is the main request fulfilled.
Release (v0.2): https://github.com/paradisusis/movielist-compn/releases/tag/v0.2.0
Source code: https://github.com/paradisusis/movielist-compn
It also runs on Linux:
(Tested on Ubuntu 22.04 using Mono runtime)
CLI:
The command line version was created in a secondary repository to facilitate adding text/console functionality alongside the main graphical release. The algorithm remains identical for both versions, with the added benefit of being able to perform tasks like concatenating multiple programs, using batch files, and other command-line-related feats.
Release (v0.1): https://github.com/paradisusis/movielist-compn-cli/releases/tag/v0.1.0
Source code: https://github.com/paradisusis/movielist-compn-cli
Links & closing words:
💬 User compn's topic on DonationCoder: comparing two big different lists of strings/filenames.
📚 Libraries: MovieFileLibrary - Nuget | GitHub, FuzzySharp - Nuget | GitHub, Command Line Parser - Nuget | GitHub.
━ Divider art by Gordon Johnson from Pixabay
🏠 Main GitHub account for our project: https://github.com/paradisusis/
All gifted software is released under the Creative Commons Zero v1.0 - Public Domain dedication license.
Enjoy this release, and feel free to share it with anyone who might benefit from the program’s functionality.
Remember: Free software is like a candle. The more it is shared, the more it freely adds to everyone’s life, helping to paint a more positive “big picture” here on our beloved Earth.
Cheers!
Vic