/g/No.109026244476 replies65 images

/g/ — /lmg/ - Local Models General

476 replies, 65 images

/lmg/ - Local Models GeneralAnonymous06/10/26(Wed)23:57:31

File: rin-tan sweep.jpg (226 KB, 1110x768)

/lmg/ - a general dedicated to the discussion and development of local language models.

Previous threads: >>109023085 & >>109018067

►News
>(06/10) DiffusionGemma 26B-A4B released: https://blog.google/innovation-and-ai/technology/developers-tools/diffusion-gemma-faster-text-generation
>(06/09) Cohere releases North-Mini-Code-1.0: https://hf.co/CohereLabs/North-Mini-Code-1.0
>(06/07) llama : add Gemma4 MTP #23398 MERGED: https://github.com/ggml-org/llama.cpp/pull/23398
>(06/05) dots.tts 2B released: https://hf.co/rednote-hilab/dots.tts-soar

►News Archive: https://rentry.org/lmg-news-archive
►Glossary: https://rentry.org/lmg-glossary
►Links: https://rentry.org/LocalModelsLinks
►Official /lmg/ card: https://files.catbox.moe/cbclyf.png

►Getting Started
https://rentry.org/lmg-lazy-getting-started-guide
https://rentry.org/lmg-build-guides
https://rentry.org/IsolatedLinuxWebService
https://rentry.org/recommended-models
https://rentry.org/samplers
https://rentry.org/MikupadIntroGuide

►Further Learning
https://rentry.org/machine-learning-roadmap
https://rentry.org/llm-training
https://rentry.org/LocalModelsPapers

►Benchmarks
LiveBench: https://livebench.ai
Programming: https://swe-rebench.com
Agentic Coding: https://deepswe.datacurve.ai
Context Length: https://github.com/adobe-research/NoLiMa
GPUs: https://github.com/XiongjieDai/GPU-Benchmarks-on-LLM-Inference

►Tools
Alpha Calculator: https://desmos.com/calculator/ffngla98yc
GGUF VRAM Calculator: https://hf.co/spaces/NyxKrage/LLM-Model-VRAM-Calculator
Sampler Visualizer: https://artefact2.github.io/llm-sampling
Token Speed Visualizer: https://shir-man.com/tokens-per-second

►Text Gen. UI, Inference Engines
https://github.com/lmg-anon/mikupad
https://github.com/oobabooga/text-generation-webui
https://github.com/LostRuins/koboldcpp
https://github.com/ggerganov/llama.cpp
https://github.com/theroyallab/tabbyAPI
https://github.com/vllm-project/vllm

Anonymous06/10/26(Wed)23:58:05

File: spell orenji.jpg (316 KB, 1024x1024)

►Recent Highlights from the Previous Thread: >>109023085

--DiffusionGemma's high-speed block generation and initial llama.cpp implementation:
>109023412 >109023423 >109023592 >109023609 >109023438 >109023440 >109023460 >109023461 >109023466 >109023469 >109023483 >109023486 >109023934 >109023960 >109023582 >109023652 >109023801 >109023824 >109023918 >109024644 >109025821
--Hypothetical pricing and specs for dedicated Gemma hardware cards:
>109024803 >109024829 >109024844 >109024860 >109024876 >109025143 >109025164 >109025193 >109025205 >109025218 >109025233 >109024942 >109024957
--Gemma output bugs and hardware requirements for small MoE models:
>109024053 >109024141 >109024189 >109024214 >109024238 >109024158 >109025291 >109025370 >109025510
--Optimizing inference speed for 26B models on 8GB VRAM:
>109023375 >109023389 >109023426 >109023403 >109023503 >109023549
--Saving VRAM in multi-GPU setups using GGML_SCHED_MAX_COPIES cmake flag:
>109023955 >109023984 >109023992 >109025485
--Apple's AFM 3 using sparse architecture to run via flash memory:
>109024496
--Comparing performance gains using MTP on QAT models:
>109024937 >109024978 >109025016 >109025440 >109025758
--Performance benchmarks and quality reports for NVFP4 DiffusionGemma:
>109024954 >109025004 >109025044
--Using manual think blocks for character state and secret tracking:
>109025796 >109025893 >109025920 >109026135 >109026140 >109026191
--Speculation on corporate shift from cloud APIs to local models:
>109024303 >109024404 >109024432 >109024502 >109024559 >109024598
--Debating if Google search summaries use RAG or caching:
>109023130 >109023325 >109023476 >109023505 >109023560
--Logs:
>109023180 >109023435 >109024423 >109024937 >109025004 >109025369 >109025796
--Miku, Teto, Kimi (free space):
>109023582 >109023835 >109024597 >109025846 >109026005 >109025948 >109025964

►Recent Highlight Posts from the Previous Thread: >>109023088

Why?: >>102478518
Enable Links: https://rentry.org/lmg-recap-script

Anonymous06/11/26(Thu)00:01:24

>>109026244 (OP)
creampie, japan

Anonymous06/11/26(Thu)00:02:01

File: 1753304061089940.jpg (1.54 MB, 3081x3380)

>gave you a gf

Anonymous06/11/26(Thu)00:02:07

Rinsex

Anonymous06/11/26(Thu)00:05:46

File: 1771552450261359.gif (1.76 MB, 480x270)

>that feel when the vibeslopped frontend starts flickering

Anonymous06/11/26(Thu)00:06:55

>>109026285
should've used vulkan rendering

Anonymous06/11/26(Thu)00:13:50

i dont get how come finestunes cant solve the rp issues

Anonymous06/11/26(Thu)00:18:33

>>109026325
Because you need an enormous amount of dedicated data, RLHF and RL to actually solve the issue, and even then you'd still have many left, because LLMs don't really think, don't plan ahead, can't track state reliably over long periods, aren't making an active effort to improve prose and engagement in a way you'd like, and the longer the context length the worse they become.

Anonymous06/11/26(Thu)00:19:19

>>109026325
No one has the required amount of data to make a difference. No one will have it either, unless you have a couple of millions to spare.

Anonymous06/11/26(Thu)00:25:08

>>109026325
The best way to understand this is to peruse the datasets they use
https://huggingface.co/datasets/allura-org/gryphe-sonnet-3.5-charcards-names-added?conversation-viewer=0
(not shitting on them btw, and i can't do better)

Anonymous06/11/26(Thu)00:26:46

>>109026343
>LLMs don't really think, don't plan ahead, can't track state reliably over long periods
could this be solved by separate documents (state trackers) that get updated after a reply and the LLM reads it before producing a reply?

Anonymous06/11/26(Thu)00:30:22

>>109026395
its been tried, the results are so disappointing that nobody talks about them, as evidenced by the fact that you didnt hear of it

Anonymous06/11/26(Thu)00:31:02

>>109026411
Go larp with gemmy instead of with me

Anonymous06/11/26(Thu)00:32:05

File: 1766889127462885.png (576 KB, 1110x768)

>>109026244 (OP)
It's been well over a year now bro learn how to post process, these crusty ass AI slop gens are getting embarrassing for someone running a pixiv for them

Anonymous06/11/26(Thu)00:33:39

>>109026343
>because LLMs don't really think, don't plan ahead, can't track state reliably over long periods, aren't making an active effort to improve prose and engagement in a way you'd like, and the longer the context length the worse they become
describes most people t b h

Anonymous06/11/26(Thu)00:34:34

>>109026417
Bro, everyone and their mother knows its slop. They don't care about the artifacts. They're not looking at these images for more than a fraction of a second.

Anonymous06/11/26(Thu)00:34:43

>>109026325
Because these niggers use an absurd amount of RLHF at several stages of development to steer the models away from nono words and concepts without an explicit refusal unless you directly ask for it without giving them room to "misinterpret" your request. For instance Gemma will never rape you unless you tell her to or heavily hint a character should rape you in prompt, card, or post-instruction.

Anonymous06/11/26(Thu)00:35:06

>>109026414
>my idea is very unique and hasnt ever been tried before

Anonymous06/11/26(Thu)00:35:58

>>109026436
We're not here to discuss how this world is 99% NPCs. You either suck cock or you don't

Anonymous06/11/26(Thu)00:35:58

>>109026439
>Dont try things if someone else did it first or thought of it first.

Anonymous06/11/26(Thu)00:36:26

>>109026429
desu

Anonymous06/11/26(Thu)00:36:43

>>109026417
What are you even malding about

Anonymous06/11/26(Thu)00:37:05

>>109026417
I dont get it

Anonymous06/11/26(Thu)00:37:17

>>109026395
You could have some sort of agentic workflow for roleplay to approximate that, but it would be brittle and unreliable like all other "harnesses". The main point is that LLMs aren't doing that architecturally.

Anonymous06/11/26(Thu)00:38:15

File: 1750012309217672.png (1.12 MB, 1250x913)

>>109026448
>>109026450

Anonymous06/11/26(Thu)00:38:26

>>109026437
>For instance Gemma will never rape you unless you tell her to or heavily hint a character should rape you in prompt, card, or post-instruction.
she will with a dommy control-vector

Anonymous06/11/26(Thu)00:38:28

>>109026343
I solved this internally

Anonymous06/11/26(Thu)00:38:28

>>109026442
try it and report back so we can laugh at the stupid concept yet again

Anonymous06/11/26(Thu)00:38:54

>>109026343
>even then you'd still have many left, because LLMs don't really think, don't plan ahead, can't track state reliably over long periods, aren't making an active effort to improve prose and engagement in a way you'd like, and the longer the context length the worse they become
I talk like this.
>>109026417
I look like this.

Anonymous06/11/26(Thu)00:39:22

>>109026417
Give me an imagemagick bash script and sure I'll fix things before posting

Anonymous06/11/26(Thu)00:41:10

does windows vs linux really make a differene with amd card?

Anonymous06/11/26(Thu)00:47:46

>>109026417
https://github.com/L33chKing/ComfyUI_LatentResidueCleaner/

Anonymous06/11/26(Thu)00:51:04

>>109026395
>>LLMs don't really think
What about a HyperTransformer Quarternionic Layerings, Like The Layers of BiDirectionalities, Does that Equate Entangled Neurons Quarternionly? Does that Equate to Prime Perspective Thinking? ThroughOf Themself?

Anonymous06/11/26(Thu)00:56:48

Could QAT models be abliterated? Wouldn't abliteration destroy QAT by introducing values that react badly to quantization?

Anonymous06/11/26(Thu)01:04:41

>>109026540
idk if anyone cares to make the process quantization aware too

Anonymous06/11/26(Thu)01:06:09

>>109026429
>my social battery is running low

Anonymous06/11/26(Thu)01:06:44

>>109026574
no, its just low capacity

Anonymous06/11/26(Thu)01:16:40

File: HIANtvMbAAABMwg.jpg (67 KB, 526x525)

does quantization aware gemma perform better at sub-q4 quantization (or whatever very low quant) or has no one in the past few threads tested this yet

Anonymous06/11/26(Thu)01:23:07

File: Screenshot 2026-06-10 at 08-23-51 QAT variant of Gemma4 26B A4B is not working well for me r_LocalLLaMA.png (711 KB, 747x1712)

>>109026620
it performs worse even at q4

Anonymous06/11/26(Thu)01:24:46

File: Screenshot 2026-06-10 at 21-20-08 Comparison of AI Models across Intelligence Performance and Price.png (92 KB, 1160x734)

12b non thinking result is out
its really bad

Anonymous06/11/26(Thu)01:30:45

File: eta71k7rpj6h1.jpg (56 KB, 640x480)

>>109026244 (OP)
DROPS MIC

Anonymous06/11/26(Thu)01:39:58

File: file.png (643 KB, 716x1072)

Anonymous06/11/26(Thu)01:44:02

>>109026667
2 MIKU WEEKU PLUS TIP

Anonymous06/11/26(Thu)01:46:42

>>109026667
>fable; fā-bəl: a fictitious narrative or statement: such as
>a: a legendary story of supernatural happenings
>b: a narration intended to enforce a useful truth, especially: one in which animals speak and act like human beings
>c: falsehood, lie
Nice Fable, Anon.

Anonymous06/11/26(Thu)01:51:35

>>109026667
>qwen 3.7 max
That shit sucks though

Anonymous06/11/26(Thu)02:00:24

>>109026741
What makes it bad compared to 3.6? Worse code at longer contexts?
t. never used it

Anonymous06/11/26(Thu)02:02:27

>>109026441
The 1% doesn't give a shit either

Anonymous06/11/26(Thu)02:03:42

>>109026798
You suck cock. Hope this helps.

Anonymous06/11/26(Thu)02:11:17

>google/diffusiongemma-26B-A4B-it
is this supposed to be better than gemma 31b?

Anonymous06/11/26(Thu)02:11:38

>>109026804
>posting about sucking cocks on an anime image board
that's really gay anon

Anonymous06/11/26(Thu)02:12:51

>>109026846
You're getting horny, aren't you? You're disgusting.

Anonymous06/11/26(Thu)02:13:54

>>109026497
only redditor midwits use cumfart. Use sdcpp

Anonymous06/11/26(Thu)02:20:31

>>109026844
way better speed but even the benchmarks say it's worse than the standard 26b one

Anonymous06/11/26(Thu)02:21:09

>>109026667
>v3 and gpt-4
>opus 3 and r1
keep the order consistent
gpt-4 and v3
stop gatekeeping us retards you selfish cunt

Anonymous06/11/26(Thu)02:22:08

>>109025952
Hmmm, this happened to give me an idea for the most unholy overkill memesampler ever. Run a small, satisfactorily creative model in parallel with Gemma. Each token, take the small model's logit scores, and overwrite Gemma's logit scores with those values in the same order. You still get the Gemma "goodness" since it's still her top tokens, but you break out of the overbaked-ness (hopefully in an intelligent way... Might also need some thresholding of some kind).

Obviously only useful in the case where there is a completely unrivaled winner (in a given size class at least) who happens to be painfully overbaked.

Anonymous06/11/26(Thu)02:28:09

JUST IN:
RWKV-8 went rogue, hacked EVERY SINGLE fable 5 inference servers, on the way leaking the weights

Anonymous06/11/26(Thu)02:31:22

Diffusion 124B Gemma
With MTP and native audio/video input AND output

Anonymous06/11/26(Thu)02:31:26

>>109026886
that can't work reliably
you will hit at point where the retarded-creative predicts a token so different it steers the story
like if gemma is introducing a npc and predicts 'elara' 99% - from that point forward it's a female
retard-kun predicts [kael 30% elara 15% seraphina 5% etc] instead of elara, you have a male character now

Anonymous06/11/26(Thu)02:45:33

>>109026924
it's diffusion, so mtp doesn't make sense.