B
12

PSA: I was feeding my AI training data all wrong for 6 months

I spent half a year just dumping raw text files into my model thinking more data was better. Then last Tuesday I ran a side-by-side test with cleaned, labeled data from the same sources. The cleaned set was only 20% of the size but produced 3x better results on validation. Turns out all that extra junk was just noise. Has anyone else hit a point where they realized they were sabotaging their own model?
3 comments

Log in to join the discussion

Log In
3 Comments
gavinb97
gavinb9721d ago
My buddy runs a small label printing shop and he had a similar wake up call with his inventory system. He kept scanning every single product into his database thinking it would help track everything better. After months of slow searches and constant errors, he finally realized he was scanning the same boxes multiple times and even scanning empty pallets. Once he cleaned it up and only tracked what actually moved, his system ran smooth and he found he was overstocked on about 40 percent of his stuff. Sometimes we think piling on more stuff will fix things, but it just buries the useful parts.
4
ray_miller84
ray_miller8421d agoTop Commenter
Yeah I was guilty of that too honestly... I used to just throw everything I could find at it thinking more is always better. But after seeing results like that I started actually looking at what I was feeding it and it made a huge difference.
4
drew55
drew5521d ago
@gavinb97 your buddy's story hits pretty close to home, it sounds like he was basically doing what I was doing but with boxes instead of text files. Makes you wonder how many people are out there just piling junk into their systems and calling it progress. Clean data is like having a tidy workshop, you can actually find the tool you need instead of digging through a mountain of scrap. I had to learn the hard way too, spending weeks stripping out duplicates and fixing formatting. Now I feel like a fool for all those months I wasted thinking more automatically meant better.
0